PerfMonitor is a tool for quickly and easily collecting ETW performance data and generating useful reports. It is based on the Event Tracing for Windows (ETW) feature of the operating system which can collect information machine wide a a variety of useful events (see advanced data collection on what data is alread built into the OS and CLR). It is the powerful technology the the windows performance group uses almost exclusively to track and understand the performance of windows, and the basis for their XPerf tool. The data collected by PerfMonitor is the same data that XPerf uses, so PerfMonitor can be used to simplify data collection and simple reporting and XPerf can be used for more advanced analysis on the same data. PerfMonitor also has the ability to decode symbolic information associated with .NET Runtime code, making PerfMonitor more valuable for .NET investigations. The source code for PerfMonitor as well as the TraceEvent data parsing library on which is based are availalble at bcl.codeplex.com, so user-defined data analysis is also relatively easy.
See also FAQ, TroubleShooting, ETW Background, Understanding the Analysis Report.
PerfMonitor was designed to be easy to deploy and use. To deploy PerfMonitor simply copy the PerfMonitor.exe to the computer you wish to use it on. No additional files or installation step is needed. You can get the latest copy of PerfMonitor.exe from the PerfMonitor download page. Simply click in the PerfMonitor.bin.zip file to download it, click through the dialog boxes, and then copy the exe to your hard drive. You are then ready to go!
Open an Administrator command window (Start->All Programs ->Accessories -> Command Prompt -> Right Click -> Run as Administrator), change your directory to where you placed the perfMonitor.exe and then type the command
This command tells perfMonitor to
You can then browse the resulting report.
The
source code for CpuTest is useful when analyzing this
data. See
Understanding the Analysis Report
for more on the report.
The 'runAnalyze' command used in the quick start was designed to make doing a common set of operations easy. However it is just one possible 'path' from data collection to analysis. PerfMonitor allows these steps to be done individually, which allows for more advanced scenarios. The first step in most profiling scenario is simply to turn on and off data collection. These are what the 'start' and 'stop' operations doo
The start command turns on (machine wide) logging, and the stop command turns logging off. Only one session can be on simultaneously. If you try to start a session when one is already started, it closes the existing session and opens an new one. The effect of turning on a session is immediate and take effect on live processes as well as new processes that are created. A filename can be specifed (suggested extention is .ETL), but it will default to 'perfMonitorOutput.etl if not provided. ETW allows a huge array of events to be turned on, which can be a daunting task, so perfMonitor simplifies this by choosing a good set of default events, that are useful in a broad variety of performance invesigations but are not too expensive to collect. This includes OS kernel operations associated with processes, threads, DLLs, page faults, diskIO, and CPU sampling as well as CLR events associted with Garbage Collections, the CLR Thread pool, lock contention, exceptions, and compilation of managed code to native code. While PerfMonitor chooses a good set of default events to collect, though various options it gives you full control over what events are collected see Advanced Data Collection for more.
Once logging has been started, it will continue until it is explicitly stopped. Thus scripts that uses 'start' should be very robust to error handling to insure that the 'stop' command is exectuted under all cases (inlcuding a user hitting Ctrl-C). For this reason you should try to use the 'run' or 'collect' commands instead of 'start' and 'stop' whenever possible.
Due to a limitation in ETW implementation, kernel event can not be collected into the same ETL file as all other events. PerfMonitor works around this by making two separate files (eg *.etl and *.kernel.etl). Thus typically perfMonitor generates two files. Whenver PerfMonitor is manipuating and ETL file called X.etl, looks fvor any 'shadow' file like X.*.etl and combines the files into one when doing its processing . Thus in many scenarios you can ignore the fact that PerfMonitor generates mulitple files. You can also merge these files into a single file (see perfMonitor merge command below).
The combination of starting ETW logging, running a command, and then stopping, is so common the 'run' command was created to do precisely this. Thus the command
Is equivalent to
Note that data collection is still machine wide, so you will get information about the entire machine at the time that Foo.exe was also running. One important value to the 'run' command is that it tries pretty hard to insure tha the 'stop' command is issued under any error condition or if the user interrupts the command (Ctrl-C).
When you wish to profile a process which is already running or can't be started 'by hand', you can't use the 'run' command. While you could use the 'start' and 'stop', we wish to discource that (too easy to leave ETW on), Instead there is a 'collect' command which turns on logging, and waits for the user to type a 'S' to stop collection. Thus
Will start logging and wait for the user to type 'S' and the enter key. You can abort the collection with Ctrl-C.
When data is collected with 'collect' (or when perfMonitor 'stop' is used), by default PerfMonitor does a 'ClrRundown'. A 'ClrRundown' makes a request to every process running the .NET Runtime to dump symbolic information so that stack traces involving managed code can be decoded. This is only needed if the process of interested is managed AND it has not exited. This is why it is not necessary to do a rundown when you use the 'run' command (since the process already did a rundown when it died).
Rundown does take time and file space, so if you know that the process of interest is either not a .NET program or will have died before collection has ended you can specify the /noRundown flag to 'collect' to avoid this overhead.
As already mentioned, due to limitations in ETW, data from PerfMonitor typically consists of two files: an .ETL file and a .Kernel.ETL file. In addition, these files do not contain all the information necessary to decode the events on another machine (information missing are the EXACT version of exectuable files, as well as the mappings of hard drives to drive letters). Thus the raw information is not well suited to being transered to another maching for analysis. The perfmonitor 'merge' command is designed to fix this. By typing By typing
PerfMonitor will combine filename.ETL (defaults to perfmonitor.etl) and fileName.kernel.ETL into just fileName.ETL (deleting the originals if successful) Moreover it will include all information needed to decode the ETL file on another machine. Thus after merging the single file can now be transfered to another machine for off-line analysis. It also works equally well on the machine where the data was collected (it just takes an extra step to create).
The ETL files created by PerfMonitor are identical to files that would be created by the windows XPerf tool. Thus a useful scenario is to use PerfMonitor to collect information on a remote machine (because it is one EXE that is easy to deploy anywhere), and then use XPerf to analyze the resulting data on another machine.
Typically ETW will generate 2-4 Meg of data for every second of collection (although this can vary widely depending on the events being logged, and what applications are doing during the log). It is also true that log files larger than 500 Meg or so are quite unwieldly. Thus it is best if data is collected for less than 100 seconds. This can be a problem if you are trying to understand a performance problem that is intermittant. This is where the /Circular:sizeMB option is useful. By specifying /Circular:100 it caps the file size at 100 meg (probably enough for 25-50 seconds of trace), so logging can be left on indefinately. When the problem occurs, as long as you can stop the logging within about 25 seconds, you should have data on the performance anomoly.
Once you have ETL files created by PerfMonitor, you can run a variety of analyses on them by using the 'analyze' command
Will perform a set of useful performance evaluations on the data in filename.ETL (defauts to perfMonitor.etl) for the process designated by nameOrPID. NameOrPID is either the process name (defined to be the name of the EXE without path or extension), or the (decimal) process ID. The 'analyze' will work if you don't specify a process to focus on, but typically is not what you want. You can discover what processes were started in a given ETL file by doing
Which gives a list of process that started while data was being collected. This can then be used to determine the correct /process: qualifer to use.
As a convenience, the command the command
Will perform the run command (start data collection, run the command, stop), and then do an 'analyze' command, using the process name from the command as the focus process. This was the command used in the QuickStart, and is quite common. bsp;
The report that is generated by the 'analyze' command is meant to be self-explainitory, however a certain amount of performance background is valuable in interpreting the data. This is what is presented here. This section assumes some familiarity with ETW, Please see ETW Background if you are new to ETW.
The main report has three sections
See also Understanding CPU Performance Data, Understanding GC Performance Data, Understanding Just In Time Compiler Data, FAQ, TroubleShooting, ETW Background, Home.
The data shown by default in the PerfMonitor stack viewer are stack traces taken every millisecond on each processor on the system. Every millisecond, on every CPU whatever process is running on that CPU is stopped and the operating system 'walks the stack' associated with the running code collecting the return address of every method on the stack. Stackwalking may not be perfect. It is possible that the OS can't find the next frame (leading to broken stacks stacks) or that an optimizing compiler has removed a method call (see missing frames), which can make analysis more difficult. However for the most part the scheme works well, and has low overhead (typically less than 5% slowdown), so monitoring can be done on 'production' systems.
Because the samples are taken every millisecond per processor, each sample represents 1 millisecond of CPU time. However exactly where the sample is taken is effectively 'random', and so it is really 'unfair' to 'charge' the full millisecond to the routine that happened to be running at the time the sample was taken. While this is true, it is also true that as more samples are taken this 'unfairness' decreases as the square root of the number of samples. If a method has just 1 or 2 samples is could be just random chance that it happened in that particular method, but methods with 10 samples are likely to have truly used between 7 and 13 samples (30% error). Routines with 100 samples are likely to be within 90 and 110 (10% error). For 'typical' analysis this means you want at least 1000 and preferablly more like 5000 samples (there is diminishing routines after 10K). By collecting a few thousand samples you insure that even moderatly 'warm' methods will have at least 10 samples, and 'hot' methods will have at least 100s, which keep the error acceptably small. Because PerfMonitor does not allow you to vary the sampling frequency, this means that you need to run the scenario for at least several seconds (for CPU bound tasks), and 10-20 seconds for less CPU bound activities.
Even with 1000s of samples, there is still 'noise' that is typically at least 3%. This error gets larger as the methods being investigated have fewer samples. This makes it problematic to use sample based profiling to compare two traces to track down small regressions (say 3%). Noise is likely to be at least as large as the 'signal' (diff) you are trying to track down. Increasing the number of samples will help, however you should always keep in mind the sampling error when comparing small differences between two traces.
Before doing a detailed CPU analysis, it is important to insure that optimizing CPU will have an effect. If your program is Disk or Network bound, then optimizing CPU will have little effect. This is why the top level 'rollups' of CPU consumption are an important to look at first. If the total CPU samples add up to 400 msec, than that is an upper bound on what CPU optimization can have. You should spend your time only of costs that are large enough to justify optimization.
Because a stack trace is collected for each sample, every node has both an exclusive metric (the number of samples that were collected in that particular method) and an inclusive metric (the number of samples that collected in that method or any method that that method called). Exclusive times are useful in the bottom up view to quickly determine the 'hot' methods that are the most important to optimize. Inclusive times are useful in the top-down and caller-calle views to understand how time is spent in methods that are not in the bottom-up view.
All programs call into operating system functions to do 'primitive' operations (e.g. read from a file, update the screen ...). However these functions are typically have a large number of private internal methods to implement the functionality. This detail is typically not useful to an application developer trying to optimize HIS code, and only serves to make the report hard to read. PerfMonitor avoids this problem by grouping all functions that are part of the operating system. A method is defined to be part of the operating system if it is implemented in a DLL that is under the c:\windows directory. Once a call is made into an OS function, any further calls are grouped into the existing frame until a method outside the OS is called. This feature can be turned off with the /noOSGrouping qualifer if desired.
This grouping greatly simplifies the CPU reports PerfMonitor generates. It does however mean that the meaning of 'Exclusive' time for OS functions is different from user functions. For OS functions, exclusive time means the sample might have been in the function itself but it might also mean that it occurs in any OS function that function called.
By default PerfMonitor collects all the information needed to map the information collected at runtime (return addresses) into names of methods. However unmanaged code relies on information stored in program database file (PDBs) to provide this mapping. By default PerfMonitor does not try to map return addresses of unmanaged code to symbolic names. Typically PerfMonitor can determine the module of the return address, but needs the PDB to resolve the function name, resulting in a name like module!?. If even the module cannot be determined the address is displayed as ?!?)
PerfMonitor can be made to look up return addresses in unmanaged code using the /SymbolsForDlls:dllList, where dllList is a comma separated list of DLL names (without path or extension). For this to work the _NT_SYMBOL_PATH variable must be set so that the necessary PDB files can be found. See Symbol Resolution for complete details. You may also use the /NoOSGrouping qualifier if you are interested in the detailed stacks associated with DLLs under c:\windows.
Performance investigations can either be 'top-down' (starting with the Main program and how the time spent there is divided into methods it calls), or 'bottom-up' (starting with methods at 'leaf' methods where samples were actually taken, and look for methods that used a lot of time). Both techniques are useful, however 'bottom-up' is usually a better way to start because methods at the bottom tend to be simpler and thus easier to understand and have intuition about how much CPU they should be using. Top-down analysis will tell you how the 'large' components of your program use CPU. This can be useful to find large structural problems (a phase of your program is taking much longer than expected).
By default PerfMonitor causes the Runtime to log an event at the begining and end of each .NET Garbage Collection as well as every time 100K of objects are allocated. As with all events the precise time is logged, so the amount of time spent in the GC can be known. Most applications spend less than 10% of the total CPU time in the GC itself. If your appliation is over this %, it usually means that your allocation pattern is such that you are causing many expensive Gen 2 GCs to occur.
If the GC heap is a large percentage of the total memory used then GC heap optimization using a profile like ClrProfiler. See Memory Usage Auditing For .NET Applications for more on memory optimization. .
During SOME GCs the application itself has to be suspended so the GC can update object references. Thus the application will pause when a GC happens. If these pause times are larger than 100ms or so, it can impact the user experience. The GC statistics will track the maximum and average pause time to allow you to confirm that this bad GC behavior is not happening.
PerfMonitior tracks detailed information of what methods were Just In Time compiled. This data is mostly useful for optimizing startup (because that is when most methods get JIT compiled). If large numbers of methods are being compiled it can noticably affect startup time. This report tells you exactly how much time is being spent (in fact exactly which methods and exactly when they were compiled). If JIT time is high, the NGen Tool can be used to precompile the code, and thus eliminate most of this overhead at startup.
Event Tracing for Windows (ETW) is a general purpose, efficient, scalable event logging service built into the windows operating system. Effectively all Windows components as well as user defined componts can define logging 'Providers' that can be turned on and off selectively at any time to provide interesting information about what that component is doing. Each event is automatically tagged with the provider it came from and the exact (too 100ns) time it happened, what process and thread generated it, as well as event specific information. Events can also be optionally tagged with the exact stack trace. Much of the power of ETW comes from the fact that the operating system provider come built in with a variety of useful events including process and thread creation, 1 msec CPU sampling (with stack trace), context switches, disk I/O (including file name), DLL loading, virutal memory allocation, and network I/O. On top of these other windows components such as the .NET Runtime add their own providers that add component-specific information such as garbage collection events and Just in Time compilation (see advanced data collection for more). The result is very detailed view of the system as a whole, which is sufficient to diagnose most performance issues.
The data collection is machine wide (you automatically get events from every process in the system), which, among other things, means that data collection logistics are simpler (you don't need to know which process is important). The overhead associated with collection varies depending on the kind of events logged and what the appliation does. The default events turned on by PerfMonitor will slow the application down by less than 10% (more typically less than 5%). In addition, the Windows development group uses ETW logging extensively, so these logging paths are well tested (you are not running in a unusual mode). These characteristics make it reasonable to leave event collection on 'in production' to 'catch' unusal performance events. The '/Circular' option is particularly valuable
The size of the data files again depend on the events that were turned on and what the applications is doing. For the default events perfMonitor will typically generate between 2 and 4Meg of log per second. Thus data collection times less than 100 seconds are reasonable. If your scenario is larger then that, you should consider using a circular buffer to limit the amount of data collected.
PerfMonitor data collection is based on Event Tracing for Windows (ETW). This is a general facility for logging information in a low overhead way. It is useful extensively throughout the Windows OS and in particular is used by both the Windows OS Kernel and the .NET CLR Runtime. By default PerfMonitor picks a default set of these events that have high value for the kinds of analysis PerfMonitor can visualize. Howevever PerfMonitor can also be used as simply a data-collector, at which point it can be useful to turn on other events. This is what the /KerneEvents: /ClrEvents: and /Provider: qualifiers do
All ETW events log the following information
By far, the ETW events built into the Windows Kernel are the most fundamental and useful. Almost any data collection will want to turn at least some of these on. See Kernel ETW Event for more information on particular events. Here we will only provide an overview. It is useful to break kernel events into three groups:
The default group is the group that PerfMonitor turns on by default. The most verbose of these events is the 'Profile' event that is trigger a stack trace every millisecond for each CPU on the machine (so you know what youre CPU is doing). Thus on a 4 processor machine you will get 4000 samples (with stack traces) every second of trace time. This can add up. Assume you will get at least 1 Meg of file size per second of trace. If you need to run very long traces (100s of seconds), you should strongly consider using the circular buffer mode to keep the logs under control.
The following Kernel events are not on by default because they can be relatively verbose or are for more specialized performance investigations.
The finalThe final set of kernel events are typically useful for people writing device drivers or trying to understand why hardware or low level OS software is misbehaving
In addition to the kernel events, if you are running .NET Runtime code you are likely to want to also have the CLR ETW events truned on. PerfMonitor turns a number of these on by default. See CLR ETW Events for more information on these events.
The Kernel and CLR ETW events are by far the most important because the data they collect applies to most programs. This is why PerfMonitor turns these providers on by default, and has special qualifers (/kernelEvents and /CLREvents) to make speicify specific events easy. However sometime it is useful to also have events that tell you 'higher level' information. There are litterally hundreds of ETW providers built into Windows already, and it is also possible for you to instrument your own code with ETW event. PerfMonitor can control these other provider too. This is what the /providers: qualifer is for.
To turn a ETW provider on, you need three pieces of information
The /providers qualifer is a comma separated list of these specifications where each component above is separted by a ':' For example the silverlight provider happens to be assigned the GUID AA087E0E-0B35-4e28-8F3A-440C3F51EEF1. Thus the specification
Starts data collection with all the normal CLR and Kernel providers but also turns on the silverlight provider passing it the bitset 0xFFFFFFF and a verbose level (5). This is relatively advanced scenario as you need to know what each bit in the bitset means, and how to decode the events once they are collected.
You can find the list of providers available in the system by issuing the following 'logman' windows command
However this list is very large and overwelming. It is typically more intersting to find out the set of providers that are available for your particular application. You can do this with the command
Where PID_OF_PROCESS is the process id for your applciation. This restricts the providers to just those that are currently active for that process. This is a far smaller list, and much more likely to be relevant to you. Once you have an interesting provider you can learmn about what keywords and levels it has by issueing
This will give details on what keywords and levels can be given. You now hav the GUID, Keyword, and level information that you can then pass to the /providers: qualifer of PerfMonitor.
In version 4.5 of the .NET Runtime, the System.Diagnostics.Tracing.EventSource class was added to the runtime that makes it very easy to create your own ETW provider and generate your own events. PerfMonitor has support for during these providers on and off very easily. The first piece of this support is the ability to discover what EventSources a given application (or DLL). For example
Will list all eventSources in 'MyApp.exe' (or any DLL that is statically reachable from Mypp.exe') and list them. For example the output might be
Event sources in eventProvider.exe. EventSource Name EventSource Guid ------------------------------------------------------------------------- MinimalEventSource : 7370a335-9b9a-5ba0-63c2-8ce045c35503 MyCompany-MyApp : 74cf8ddb-febf-5c1c-67c4-c721006bace0 AnotherEventSource : a217696f-5896-584d-31d5-8c90e8d3483e
The /dumpManifests qualifer can be used to specify a directory to dump all the eventSource manifests found. For example
Will dump three .manifest.xml files into '.' (the current directory), one for each of the event sources. This gives detailed information about the keywords available in the eventSource as well as all the events and their payload schmeas.
Once you have determined which EventSource you are interested in, perfMonitor supports the syntax @filename#EventSourceName to specify eventSources in the /provider: qualifer. For example you can turn on the 'MinimalEventSource' by specifying
Which runs 'MyApp.exe' with the MinimalEventSource turned on. If you don't specify the # syntax, PerfMonitor assumes you want all EventsSources turned on. Thus
Turns on all three EvenSources in MyApp.exe. and runs the application and . When using the 'run' command, you can event exclude the Filename (it is assumed to be the exe that is run). Thus
Is equivalent to the previous example. The command above will turn on the Kernel and CLR events asl well as the EventSources. You would need to specify the /clrEvents:None and /kernelEvents:Process,Image qualifiers to suppress all but the least verbose events. This is common enough that PerfMonitor has a special command called 'monitorRun' that does just this. Thus
Is really shorthand for
Once you have created the ETL file, you can print out the formatted text from EventSources using the printSources command
Will pretty print the events in the PerfMonitorOutput.etl file to standard output. Use the /outputFile=FILE qualifer to send the output elsewhere. Monitoring and then printing the EventSources output as text is common enough that there is a shortcut 'monitorPrint' command
Will monitor the app (turn on all its eventSources), and then print the resulting output. If XML output is desired instead you can use the 'monitorDump' shortcut.
Finally perfMonitor also supports realtime 'listening' to event sources. For example
Will listen to all the eventSources in 'MyApp' and print them out in 'real time'. Control - C ends the session.
At collection time, when a CPU sample or a stack trace is taken, it is represented by an address in memory. This memory address needs to be converted to symbolic form to be useful for analysis. This happens in two steps.
If the first step fails (uncommon), then the address is given the symbolic name ?!? (unknown module and method). However if the second step fails (more common) then you can at least know the module and the address is given the symbolic name module!?.
Code that does not belong to any DLL must have been dynamically generated. If this code was generated by the .NET Runtime by compiling a .NET Method, it should have been decoded by PerfMonitor. Howvever if you specified the /NoRundown or the log file is otherwise incomplete, it is possible that the information necessary to decode the address has been lost. More commonly, however there are a number of 'anonymous' helper methods that are generated by the runtime, and since these have no name, there is not much to do except leave them as ?!?. These helper typically are uninteresting (they don't have much exclusive time), and can ignored during analysis. They typically happen at the boundary of managed and unmanaged code.
Code that was not generated at runtime is always part of the body of a DLL, and thus the DLL name can always be determined. Precompiled managed code lives in (NGEN) images which have in .ni in their name and the information should be in the ETL file PerfMonitor collected. If you see things unknown function names in modules that have .ni in them it implies that something went wrong with CLR rundown (see ?!? methods). For unmanaged code (that do not have .ni) the addresses need to be looked up in the symbolic information associated with that DLL. This symbolic information is stored in program database files (PDBs), and can be fairly expensive (10s of seconds or more), to resolve a large trace. Because of this PerfMonitor by default only fully decodes 'High Value' DLLs including
This is a good set of defaults for most managed code development. However if your performance problem deals with unmanagd code., You will want other DLLs decoded. Currently the only way to do this is by specifying the /SymbolsForDlls:dll1,dll2 ... when the data is analyzed. Where the dlls do NOT have their file name extension or path. (TODO: Improve the experence). The first time you use the /SymbolsForDlls qualifier you will also want to use /PrimeSymbols as explained below.
When a trace is opened for analysis, it does the symbolic lookup for unmanaged dlls. Click the 'log' button in the lower right corner to see detailed diagnostic information about symbol loockup. An important part of this lookup is finding he program database (PDB) files. Like most diagnostic tools, PerfMonitor uses the _NT_SYMBOL_PATH environment variable for a semi-colon separated list of places to look for PDB files. If you have unmanged executables that you built, this environment variable should include paths to the PDBs that your compiler generated.
The _NT_SYMBOL_PATH variable can have a special syntax of SRV*cacheDir*symbolServer, where cacheDir is a directory on the local machine, and symbolsServer is a network or web path. This directs PerfMonitor to look up the PDB on the remote symbol server and copy the PDB to cacheDir, and then decode the symbols. If the PDB is archived in a symbol server, looking up the PDB is now easy (you point at a symbol server) but it can also be VERY slow (10s of seconds or more because of network latancies). The most important use of symbol servers is to get the PDBs for the Windows operating system DLLs.
Because symbol servers can be so slow and often you don't need the extra symbol information anyway (e.g if you are doing a pure managed investigation), by default PerfMonitor does NOT look up symbols on NON-LOCAL locations (it will look in al machine-local locations, including symbol server cache locations). However this does NOT work well the FIRST time. The FIRST time you must use the /PrimeSymbols option when launching PerfMonitor (TODO improve experience). This will tell PerfMonitor to take the time to look up the PDBs on symbol servers (and prime the local symbol server cache). After doing this once, you should not have to do it again (until OS dlls get updated).
By far, the most common unmanaged DLLs of interest are the DLLs that Microsoft ships as part of the operating system. Thus if you don't specify a _NT_SYMBOL_PATH PerfMonitor uses the following 'standard' one
This says is to look up PDB at the stanard microsoft PDB server http://msdl.microsoft.com/download/symbols and cache them locally in %TEMP%\symbols. Thus by default you can always find the PDBs for stanard Microsoft DLLs.
Thus typically all you need to get good symbols is
Run PerfMonitor /PrimeSymbols once when doing your first analysis. By default this will fetch the important OS and CLR dlls from the default Microsoft symbol server.
Afterward just run PerfMonitor normally. You will get the cached symbols and it will be (reasonably) fast (Because it finds them locally on your computer)
If you are investigating performance problems of unmanaged DLLs of EXEs that did not come from Microsoft (e.g you built them yourself), you have to set the _NT_SYMBOL_PATH to include the location of these PDBs before launching PerfMonitor.
Here are some tips for troubleshooting comon issues.
Because of the expense of looking up symbols for unmanaged DLL, PerfMonitor requires a small amount of work by the user to decode them. First PerfMonitor by default only decodes a handful of important operating system dlls by default. You must use the /SymbolsForDlls:dll1,dll2 to specify more. Second, even for these important DLLs PerfMonitor NEVER tries to find symbols except on the local machine unless the /PrimeSymbols options is given. Thus this option must be given the FIRST time you use PerfMonitor and wish to have symbols. Finally if the DLL is not shipped from Microsoft, symbol search path is almost certain NOT to work, you need to tell PerfMonitor where to find symbols by specifying the _NT_SYMBOL_PATH variable.
Thus if this is the first time you have used PerfMonitor and the symbols are important to your analysis, and the DLL were created by microsoft, then you can
If that does not work, look carefully at the *.convertlog.txt that is generated and read see Symbol Resolution for more complete information.
When a sample is taken, the ETW system attempts to take a stack trace. For a variety of reasons it is possible that this will fail before a complete stack is taken. PerfMonitor uses the heuristic that all stacks should end in a frame in a particular OS DLL (ntdll) which is responsible for creating threads. If a stack does not end there, PerfMonitor assumes that it is broken, and injects a pseduo-node called 'BROKEN' between the thread and the part of the stack that was fetched (at the very least it will have the address of where the sample was taken). Thus BROKEN stacks should always be direct children of some frame representing a OS thread.
When the number of BROKEN stacks are small (say < 3% of total samples), they can simply be ignored. This is the common case. However the more broken stacks there are, the less useful a 'top-down' analysis (using the CallTree View) is because effectively some non-trivial fraction of the samples are not being placed in their proper place, giving you skewed results near the top of the stack. A 'bottom-up' analysis (where you look first as where methods where samples occured) is not affected by broken stacks (however as that analysis moves 'up the stack', it can be affected)
Broken stacks occur for the following reasons
If you are profing a 64 bit process there is pretty good chance that you are being affected by scenario (2) above. There are three work-arounds to broken stacks in that instance
C:\Windows\Microsoft.NET\Framework\v2.0.50727\NGen install YourApp.exe.
You will have to repreat this every time your application is recompiled.Missing stack frames are different than a broken stack because it is frames in the 'middle' of the stack that are missing. Typically only one or maybe two methods are missing. There are three basic reasons for missing stacks.
While missing frames can be confusing and thus slow down analysis, they rarely truely block it. Missing frames are the price paid for profiling unmodified code in a very low overhead way.
PerfMonitor emits a ? for any program address that it can not resolve to a symbolic name. See troubleshooting Symbols and Symbol Resolution for more.
If the stack trace that is taken at data sample time does not terminate in OS DLL that starts threads, the stack is considered broken. See Broken Stacks for more.
The algorithm used to crawl the stack is not perfect. In some cases there is not sufficient information on the stack to quickly find the caller. Also compilers perform inlining, tailcall and other optionations that literally remove the frame completely at runtime. The good news is that while sometimes confusing, it is usually pretty easy to fill in the gaps. See Missing Frames for more.
/// This class is simply a example of a CPU bound computation. It does it recurisively to /// make the example more interesting. The main entry point is the RecSpin routine. class Spinner { public static int aStatic = 0; // Spin for 'timeSec' seconds. We do only 1 second in this // method, doing the rest in the helper. public static void RecSpin(int timeSec) { if (timeSec <= 0) return; --timeSec; SpinForASecond(); RecSpinHelper(timeSec); } // RecSpinHelper is a clone of RecSpin. It is repeated // to simulate mutual recursion (more interesting example) static void RecSpinHelper(int timeSec) { if (timeSec <= 0) return; --timeSec; SpinForASecond(); RecSpin(timeSec); } // SpingForASecond repeatedly calls DateTime.Now until for // 1 second. It also does some work of its own in this // methods so we get some exclusive time to look at. static void SpinForASecond() { DateTime start = DateTime.Now; for (; ; ) { if ((DateTime.Now - start).TotalSeconds > 1) break; // Do some work in this routine as well. for (int i = 0; i < 10; i++) aStatic += i; } } }