How Can a Stack Trace Point to the Wrong Line (the "return" Statement) - 40 Lines Off How Can a Stack Trace Point to the Wrong Line (the "return" Statement) - 40 Lines Off asp.net asp.net

How Can a Stack Trace Point to the Wrong Line (the "return" Statement) - 40 Lines Off


Can PDBs be off more than 2 or 3 lines?

You give the statement that you have never seen PDBs off more than a few lines. 40 lines seem to be too much, especially when the decompiled code doesn't look much of a difference.

However, that's not true and can be proven by a 2 liner: create a String object, set it to nulland call ToString(). Compile and run. Next, insert a 30 line comment, save the file, but do not recompile. Run the application again. The application still crashes, but gives a 30 lines difference in what it reports (line 14 vs. 44 in the screenshot).

It is not related at all to the code that gets compiled. Such things can easily happen:

  • code reformat, which e.g. sorts the methods by visibility, so the method moved up 40 lines
  • code reformat, which e.g. breaks long lines at 80 characters, usually this moves things down
  • optimize usings (R#) which removes 30 lines of unneeded imports, so the method moved up
  • insertion of comments or newlines
  • having switched to a branch while the deployed version (matching the PDB) is from trunk (or similar)

PDBs off by 30 lines

How can this happen in your case?

If it's really as you say and you seriously reviewed your code, there are two potential issues:

  • EXE or DLL does not match to PDBs, which can easily be checked
  • PDBs do not match to source code, which is harder to identify

Multithreading can set objects to null when you least expect it, even if it has been initialized before. In such a case, NullReferenceExceptions can not only be 40 lines away, it can even be in a totally different class and therefore file.

How to continue

Capture a dump

I'd first try to get a dump of the situation. This allows you to capture the state and look at everything in detail without the need of reproducing it on your developer machine.

For ASP.NET, see the MSDN blog Steps to Trigger a User Dump of a Process with DebugDiag when a Specific .net Exception is Thrown or Tess' blog.

In any case, always capture a dump including full memory. Also remember to collect all necessary files (SOS.dll and mscordacwks.dll) from the machine where the crash occured. You can use MscordacwksCollector (Disclaimer: I'm the author of it).

Check the symbols

See if the EXE/DLL really matches to your PDBs. In WinDbg, the following commands are helpful

!sym noisy.reload /flm!lmi <module>

Outside WinDbg, but still using debugging tools for Windows:

symchk /if <exe> /s <pdbdir> /av /od /pf

3rd party tool, ChkMatch:

chkmatch -c <exe> <pdb>

Check the source code

If PDBs match to DLLs, the next step is to check whether the source code belongs to the PDBs. This is best possible if you commit PDBs to version control together with the source code. If you did that, you can search the matching PDBs in source control and then get the same revision of source code and PDBs.

If you didn't do that, you're unlucky and you should probably not use source code but work with PDBs only. In case of .NET, this works out pretty well. I'm debugging a lot in 3rd party code with WinDbg without receiving the source code and I can get pretty far.

If you go with WinDbg, the following commands are useful (in this order)

.symfix c:\symbols.loadby sos clr!threads    ~#s!clrstack!pe

Why code is so important on StackOverflow

Also, I looked at the code of the View() method, and there is no way for it to throw a NullReferenceException

Well, other people have made similar statements before. It's easy to overlook something.

The following is a real world example, just minimized and in pseudo code. In the first version, the lock statement didn't exist yet and DoWork() could be called from several threads. Quite soon, the lock statement was introduced and everything went well. When leaving the lock, someobj will always be a valid object, right?

var someobj = new SomeObj(); private void OnButtonClick(...){    DoWork();}var a = new object();   private void DoWork(){    lock(a) {        try {            someobj.DoSomething();            someobj = null;            DoEvents();                     }        finally        {            someobj = new SomeObj();        }    }   }

Until one user reported the same bug again. We were sure that the bug was fixed and this was impossible to happen. However, this was a "double-click user", i.e. someone who does a double click on anything that can be clicked.

The DoEvents() call, which was of course not in such a prominent place, caused the lock to be entered again by the same thread (which is legal). This time, someobj was null, causing a NullReferenceException in a place where it seemed impossible to be null.

That second time, it was return boolValue ? RedirectToAction("A1","C1") : RedirectToAction("A2", "C2"). The boolValue was an expression which could not have thrown the NullReferenceException

Why not? What is boolValue? A property with a getter and setter? Also consider the following (maybe a bit off) case, where RedirectToAction takes only constant parameters, looks like a method, throws an exception but is still not on the callstack. This is why it is so important to see code on StackOverflow...

Screenshot: method with constant parameters not on callstack


I have seen this kind of behavior in production code once. Although the details are a little vague (It was approximately 2 years ago, and although I can find the email, I don't have access to the code anymore, and neither the dumps etc.)

FYI, this is what I wrote to the team (very small parts from the large mail) -

// Code at TeamProvider.cs:line 34Team securedTeam = TeamProvider.GetTeamByPath(teamPath); // Static method call.

"No way null reference exception can happen here."

Later, after more dump diving

"Findings -

  1. The issue was happening in DBI because it did not have root/BRH team. UI is not handling the null returned by CLib gracefully, and hence the exception.
  2. The stack trace shown on UI was misleading, and was due to the fact that Jitter and CPU can optimize / reorder instructions, causing stack traces to "lie".

Digging into a process dump revealed the issue, and has been confirmed that DBI indeed did not have above mentioned team."


I think, the thing to note here is the statement in bold above, in contrast with your analysis and statement -

"I just looked at the decompiled source, and it does not appear to have been rearranged.", or

"Production build running on my local machine shows the correct line number."

The idea is that optimizations can happen at different levels.. and those done at compile time are just some of them. Today, especially with Managed environment like .Net, there are actually relatively fewer optimizations done while emitting IL (Why should 10 compilers for 10 different .Net languages try to do same set of optimizations, when the emitted Intermediate Language code will be further transformed into machine code, either by ngen, or Jitter).

Hence what you have observed, can only be confirmed by looking at the jitted machine code (aka assembly) from a dump from the production machine.


One question I can see coming is - Why would Jitter emit different code on Production machine, compared to your machine, for the same build?

Answer - I don't know. I am not a Jit expert, but I do believe it can... because as I said above.. Today these things are way more sophisticated compared to technologies used 5-10 years ago. Who knows, what all factors.. like "memory, number of CPUs, CPU load, 32 bit vs 64 bit, Numa vs Non-Numa, Number of times a method has executed, how small or big a method is, Who calls it, what does it call, how many times, access patterns for memory locations, etc etc" does it look at while making these optimizations.

For your case, so far only you can reproduce it, and only you have access to your jitted code in production. Hence, (If I may say so :)) this is the best answer anyone can come up with.


EDIT:An important difference between the Jitter on one machine vs the other, can also be the version of the jitter itself. I'd imagine that as several patches and KBs are released for .net framework, who knows what differences of optimization behavior jitter with even minor version differences may have.

In other words, it is not sufficient to assume that both machines have same Major version of the framework (lets say .Net 4.5 SP1). Production may not have patches which are released every day, but your dev / private machine may have patch released last Tuesday.


EDIT 2: Proof of concept - i.e. Jitter optimizations can lead to lying stack traces.

Run following code yourself, Release build, x64, Optimizations on, all TRACE and DEBUG turned off, Visual Studio Hosting Process turned off. Compile from visual studio, but run from explorer. And try to guess which line the stack trace will tell you the exception is at?

class Program{    static void Main(string[] args)    {        string bar = ReturnMeNull();        for (int i = 0; i < 100; i++)        {            Console.WriteLine(i);        }        for (int i = 0; i < bar.Length; i++)        {            Console.WriteLine(i);        }        Console.ReadLine();        return;    }    [MethodImpl(MethodImplOptions.NoInlining)]    static string ReturnMeNull()    {        return null;    }}

Unfortunately, after few attempts, I still cannot reproduce exact problem you have seen (i.e. error on return statement), because only you have access to exact code, and any specific code pattern it may have. Or, once again, it is some other Jitter optimization, which is not documented, and hence hard to guess.


Just a thought, but the one thing that I can think of is that perhaps there is a possibility that your build definition/configuration is pushing out an out of synch compiled version of your application dll(s) and this is why you see the discrepancy on your machine when you look up the line number from the stacktrace.