Fast capture stack trace on windows / 64-bit / mixed mode Fast capture stack trace on windows / 64-bit / mixed mode windows windows

Fast capture stack trace on windows / 64-bit / mixed mode


9-1-2015 - I've located original function which gets called by process hacker, and that one was

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dllOutOfProcessFunctionTableCallback

it's source code - which was here:https://github.com/dotnet/coreclr/blob/master/src/debug/daccess/fntableaccess.cpp

From there I have owner of most of changes in that source code - Jan Kotas (jkotas@microsoft.com) and contacted him about this problem.

From: Jan Kotas <jkotas@microsoft.com>To: Tarmo Pikaro <tapika@yahoo.com> Sent: Friday, January 8, 2016 3:27 PMSubject: RE: Fast capture stack trace on windows 64 bit / mixed mode......The mscordacwks.dll is called mscordaccore.dll in CoreCLR / github repro. The VS project files are auto-generated for it during the build (\coreclr\bin\obj\Windows_NT.x64.Debug\src\dlls\mscordac\mscordaccore.vcxproj).You should be able to build and debug CoreCLR to understand how it works....From: Jan Kotas <jkotas@microsoft.com>To: Tarmo Pikaro <tapika@yahoo.com> Sent: Saturday, January 9, 2016 2:02 AMSubject: RE: Fast capture stack trace on windows 64 bit / mixed mode...> I've tried to replace > C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll dll loading > with C:\Prototyping\dotNet\coreclr-master\bin\obj\Windows_NT.x64.Debug\src\dlls\mscordac\Debug\mscordaccore.dll> loading (just compiled), but if previously I could get mixed mode stack trace correctly:> ...mscordacwks.dll is tightly coupled with the runtime. You cannot mix and match them between runtimes.What I meant is that you can use CoreCLR to understand how this works.

But then he recommended this solution which was working for me:

int CaptureStackBackTrace3(int FramesToSkip, int nFrames, PVOID* BackTrace, PDWORD pBackTraceHash){    CONTEXT ContextRecord;    RtlCaptureContext(&ContextRecord);    UINT iFrame;    for (iFrame = 0; iFrame < nFrames; iFrame++)    {        DWORD64 ImageBase;        PRUNTIME_FUNCTION pFunctionEntry = RtlLookupFunctionEntry(ContextRecord.Rip, &ImageBase, NULL);        if (pFunctionEntry == NULL)            break;        PVOID HandlerData;        DWORD64 EstablisherFrame;        RtlVirtualUnwind(UNW_FLAG_NHANDLER,            ImageBase,            ContextRecord.Rip,            pFunctionEntry,            &ContextRecord,            &HandlerData,            &EstablisherFrame,            NULL);        BackTrace[iFrame] = (PVOID)ContextRecord.Rip;    }    return iFrame;}

This code snipet still is missing backtrace hash calculation, but this is something can can be added afterwards.

It's very import also to note that when debugging this code snipet you should use native debugging, not mixed mode (C# project by default use mixed mode), because it somehow disturbs stack trace in debugger. (Something to figure out how and why such distortion happens)

There is still one missing piece of puzzle - how to make symbol resolution fully resistant to FreeLibrary / Jit code dispose, but this is something I need to figure out still.

Please note that RtlVirtualUnwind will most probably work only on 64-bit architecture, not on arm or 32-bit.

One more funny thing is that there exists function RtlCaptureStackBackTracewhich somehow resembles windows api function CaptureStackBackTrace - but they somehow differ - at least by naming. Also if you check RtlCaptureStackBackTrace - it calls eventually RtlVirtualUnwind - you can check it from Windows Research Kernel source codes

RtlCaptureStackBackTrace>RtlWalkFrameChain>RtlpWalkFrameChain>RtlVirtualUnwind

But what I have tested RtlCaptureStackBackTrace does not works correctly.Unlike function RtlVirtualUnwind above.

It's a kinda magic. :-)

I'll continue this questionnaire with phase 2 question - in here:

Resolve managed and native stack trace - which API to use?


x64 Stack walking is complicated as you have already found out. A simple alternative is to simply not do it but leave the hard things to the OS ETW stackwalker. That works and it is much faster than you will ever get.

You can take advantage of it by emitting your own ETW event. Before that you need to start an ETW session for your event provider and enable stack walking for your provider. There is a catch on Windows 7 where it does not work unless the managed stack frames are all NGenned because the x64 ETW Stackwalker will stop if he finds a stack frame which is not in any loaded module which is true for JITed code.

Starting with Windows 8 the ETW Stackwalker will always walk the first MB of the stack for stack frames which fixes the JIT problem. The JIT compiler emits Unwind Infos for the generated code if ETW tracing is on and registers it via RtlAddGrowableFunctionTable which makes it possible to walk the stack fast from within the kernel in the first place. Things work differently when ETW tracing is not enabled for compatibility reasons.

If you are after malloc/free new/delete memory leaks you can also use the OS bultin capabilities of Heap allocation tracing which already exists since Windows 7. See xperf -help start and https://randomascii.wordpress.com/2015/04/27/etw-heap-tracingevery-allocation-recorded/ for more information about heap allocation tracing. You can enable it for an already running process with no problems. The downside is that for any real world application the generated data is huge. But if you are after big allocations only then it can help to track only VirtualAlloc calls which can also be enabled with minimal overhead.

Managed code since .NET 4.5 has also its own ETW allocation tracing provider with full stack walking even on x64 Windows 7 because it does a full managed stackwalk by itself. More infos can be found in the CoreClr Sources at:ETW::SamplingLog::SendStackTrace in https://github.com/dotnet/coreclr/blob/master/src/inc/eventtracebase.hfor many more details.

That is only a rough outline what is possible. To really get all of the necessary details would take a whole book I fear. And I am still learning new things every day.

Here is a heapalloc.cmd script which you can use to track heap allocations. By default it records into a 500MB ring buffer if your leak builds up over longer periods of time recording all allocation stacks without condensing them at runtime will not work out with WPA. But you could post process a huge ETL file and write your own viewer for it.

@echo off setlocal enabledelayedexpansionREM consider using a different drive for ETL output to prevent slowing down REM your application and to prevent lost buffersset OUTDIR=C:\TEMPset OUTFILENAME=HeapTracing.etlREM Final output fileset OUTFILE=!OUTDIR!\!OUTFILENAME!set CLRUNDOWNFILE=!OUTDIR!\clr_HeapDCend.etlset KERNELFILE=!OUTDIR!\kernel.etlset CLRSESSIONFILE=!OUTDIR!\clrHeapSession.etlset HEAPUSERFILE=!OUTDIR!\HeapUserSession.etlREM Default is allocation and realloc to track memory leaksREM HeapFree is the other option to track double free callsset HEAPTRACINGFLAGS=HeapAlloc+HeapRealloc if "%3" NEQ "" (echo Overriding Heap Tracing Flags with: %3set HEAPTRACINGFLAGS=%3)if "%1" EQU "-start" (     call :StartTracing -PidNewProcess %2    goto :Exit ) if "%1" EQU "-attachPid" (     call :StartTracing -Pids %2    goto :Exit ) if "%1" EQU "-startNext" (    reg add "HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\%~nx2" /v TracingFlags /t REG_DWORD /d 1 /f    if not %errorlevel% == 0 goto failure    call :StartTracing -Pids 0    goto :Exit)if "%1" EQU "-stop" (     set XPERF_CreateNGenPdbs=1    xperf -start ClrRundownSession -on e13c0d23-ccbc-4e12-931b-d9cc2eee27e4:0x118:5+a669021c-c450-4609-a035-5af59af4df18:0x118:5 -f "!CLRUNDOWNFILE!" -buffersize 256 -minbuffers 256 -maxbuffers 512     call :WaitUntilRundownCompleted "!CLRUNDOWNFILE!"    xperf -stop -stop ClrSession ClrRundownSession HeapSession | findstr /V identifiable 2> NUL    echo Merging profiles    REM Reset symbol path to create the pdbs files in the output directory with in the directory with the same name like our etl file    set TMPSYMBOLPATH=!_NT_SYMBOL_PATH!    REM Each tool is using a different pdb cache folder. If you are using them side by side     REM you have to wait a long time to refresh the pdb cache. To spare the waiting time we use     REM the pdb cache folder from WPR    mkdir C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache 2> NUL    set _NT_SYMBOL_PATH=srv*C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache     mklink /D "!OUTFILE!.NGENPDB" C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache  2> NUL    echo Managed PDBs are stored at: !OUTFILE!.NGENPDB. If you want to transfer the etl do not forget to copy this directory with the pdbs as well.     echo Merging ETL files and generating native pdbs    xperf -merge  "!KERNELFILE!" "!CLRSESSIONFILE!" "!CLRUNDOWNFILE!" "!HEAPUSERFILE!" "!OUTFILE!"    set _NT_SYMBOL_PATH=!TMPSYMBOLPATH!    echo !OUTFILE! was created    if "%2" NEQ "" reg delete "HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\%~nx2" /v TracingFlags /f 2> NUL    goto :Exit ) goto Usage::StartTracingxperf -start ClrSession -on  Microsoft-Windows-DotNETRuntime:5 -f "!CLRSESSIONFILE!" -buffersize 128 -minbuffers 256 -maxbuffers 512 xperf -on PROC_THREAD+LOADER+latency+virt_alloc -stackwalk VirtualAlloc  -f "%KERNELFILE%"xperf -start HeapSession -heap %1 %2 -BufferSize 1024 -MinBuffers 128 -MaxBuffers 1024 -stackwalk %HEAPTRACINGFLAGS% -f "!HEAPUSERFILE!" -FileMode Circular -MaxFile 1024exit /BREM Wait until writing to ETL file has stopped by checking its file size:WaitUntilRundownCompleted:StillWriting    for %%F in (%1) do set "size=%%~zF"    timeout /T 1  > nul    for %%F in (%1) do set "size2=%%~zF"    if "!size!" EQU "" goto :EndWriting    if "!size!" NEQ "!size2!" goto StillWriting:EndWritingtimeout /T 1  > nulexit /B:Usage    echo Usage:     echo HeapAlloc.cmd -start [executable] or -stop    echo               -start [executable] Start a trace session     echo               -startNext [executable] Start heap tracing for all subsequent calls to executable.    echo               -attachPid ddd Start a trace session for specified process    echo               -stop  [executable] Stop a trace session     echo Examples    echo     HeapAlloc.cmd -startNext devenv.exe    echo     HeapAlloc.cmd -stop      devenv.exe    echo To attach to a running process    echo     HeapAlloc.cmd -attachPid dddd    echo     HeapAlloc.cmd -stop     echo You must call -stop for your executable if you have used -start or startNext because heap allocation tracing will enabled until you stop it!goto :Exit :failure    echo Error occuredgoto :Exit:Exit


25.1.2016 Writing as separate issue, as complementary information.

For stack unique id CaptureStackBackTrace uses simple sum of all instruction pointers - idea is borrowed from: "Windows_Research_Kernel(sources)\WRK-v1.2\base\ntos\rtl\amd64\stkwalk.c":

    size_t hashValue = 0;    for (int i = 0; i < nFrames; i++)        hashValue += PtrToUlong(BackTrace[i]);    *pBackTraceHash = (DWORD)hashValue;

I'm not sure about last conversion - some specify last parameter as DWORD, some as ulong64, but it's not relevant. The main problem with this calculation is that it's not unique enough. For case of recursive function calls - if you have call order:

func1func2func3

Stack trace for:

func1func3func2

Will be identical.

What I have debugged - for memory leak detection I'm getting 62876 false hits - unique stack id calculation is not reliable enough.

I've bit switched formula to:

static DWORD crc32_tab[] ={    0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,    0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,    0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,    0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,    0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,    0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,    0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,    0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,    0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,    0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,    0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,    0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,    0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,    0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,    0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,    0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,    0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,    0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,    0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,    0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,    0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,    0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,    0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,    0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,    0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,    0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,    0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,    0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,    0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,    0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,    0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,    0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,    0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,    0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,    0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,    0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,    0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,    0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,    0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,    0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,    0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,    0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,    0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d};if (pBackTraceHash){    size_t hashValue = 0;    for( int idxFrame = 0; idxFrame < (int)iFrame; idxFrame++ )    {        unsigned char* p = (unsigned char*)&BackTrace[idxFrame];        for( int i = 0; i < sizeof(void*); i++ )            hashValue = crc32_tab[ ((hashValue ^ *p++) & 0xFF) ] ^ (hashValue >> 8);    }    *pBackTraceHash = (DWORD)hashValue;}

This algorithm does not give false hits, but slows down execution a little bit.

Also memory leak statistics differs:Unreliable algorithm: Total amount of leaked memory: 48'874'764 / in 371 allocation poolsCrc32 based algorithm: Total amount of leaked memory: 48'874'764 / in 614 allocation pools

Like you see - statistics combines (pools) similar call stack together - less fragmentation, but original call stack is lost. (Incorrect statistics)

May be someone could give me some faster algorithm for this ?