‘Frames below may be incorrect’, or: Stack Walking Requires Symbols

Here’s the symptom – you stop and inspect a stack:

stack_dlg1_cut

Note the message at the second line:

Frames below may be incorrect and/or missing. No symbols loaded for XXXXX.dll.

Chances are you read it once years ago and ignored it ever since.  If so, that’s a shame – because the debugger is dead serious about it.

You load some missing symbols (either from the modules window, or by right clicking a line on the stack window), and the stack changes. Often, the code displayed (topmost frame where code is available) is seen to be misleading – it’s nowhere to be found on the updated stack!

stack_Dlg2_cut

This happens a lot, e.g. when uncaught exceptions are thrown outside your own code, when you pause execution, or when switching to a different thread while stopped at a breakpoint.  Essentially – it happens whenever stack walking has to start from an optimized module without loaded symbols (in particular, MS modules like ntdll above).

First Corollary

Loading MS-symbols is NOT optional!

Hopefully every dev in the civilized world knows what these are and how to get them (MS made it considerably easier since VS2008), so I will not rehash. What is not widely obvious, is that without them there’s a good chance you’re looking at wrong stacks.

Second Corollary

Apparently stack walking is harder than it seems, and depends in some way on debug information.

That was news to me, and called for some research.

Naive Stack Walking

[A full x86-stack-layout tutorial is an undertaking way beyond the extent of my spare time. Try e.g. here for a nice read. The following is a very rough and minimal description]

Ignoring most stack content (function parameters, local variables, calling conventions, exception handling, buffer security and such), a naive layout of an x86 stack frame is something like –

Memory address:   Stack elements:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 0x104FD8   |  | Parameters                 | \
            |  +----------------------------+  |
 0x104FD4   |  | Return address, routine 2  |  |
            |  +----------------------------+   >  Stack frame 3
 0x104FD0   +--| EBP value for routine 2    |  |
               +----------------------------+  |
 0x104FCC   +->| Local data                 | /  <-- Routine 3's EBP
            |  +----------------------------+
 0x104FC8   |  | Return address, routine 3  | \
            |  +----------------------------+  |
 0x104FC4   +--| EBP value for routine 3    |  |
               +----------------------------+   >  Stack frame 4
 0x104FC0      | Local data                 |  | <-- Current EBP
               +----------------------------+  |
 0x104FBC      | Local data                 | /
               +----------------------------+
 0x104FB8      |                            |    <-- Current ESP
                \/\/\/\/\/\/\/\/\/\/\/\/\/\/

Where the Extended Base Pointer slots store the contents of the register that marks the start of a stack frame. So obtaining a stack trace should be as simple as

  1. Walking the chain of EBPs (each EBP slot on a stack frame points to just below the EBP slot of its parent frame) to form the stack of frames.
  2. From above each EBP slot, take the stored EIP and use it to decipher the calling module and the calling function name.

Phase (2) is indeed best done with symbols, but  can be achieved with some success (exported functions only) with only linker map files. Either way, this description cannot account for the dependency of the stack-frames partition itself upon the presence of symbols! There has to be more to the story.

First Exception – FPO

Ever since the Intel 386, both ESP and EBP can serve as reference points for local stack variables, and faced with the scarcity of registers on x86 systems it was a tempting optimization to drop EBP dedicated usage altogether. This is called Frame Pointer Omission, and indeed mandates dedicated stack-walking assistance info in the PDB – as the traditional EBP chain breaks completely once a single stack frame uses FPO. However, FPO have been rarely used in practice and completely disabled in MS builds since Vista so it cannot possibly account for all occurrences of bad stack traces.

Others Exceptions?

Well, probably – but not documented ones. I have several reasons to believe so:

(1) StackFrameTypeEnum (used by IDiaStackFrame) indeed includes FrameTypeStandard and FrameTypeFPO, but also other frame types (notably FrameTypeFrameData).

(2) The VC team blog did hint that a PDB includes, quote,

the unwind program to execute to walk to the next frame

(3) I’ve witnessed these symptoms on dumps taken from Win7 machines – and AFAIK Win7 binaries were compiled without FPO.

(4) The venerable John Robbins answered an email of mine, saying:

Yes, unless you have all symbols loaded for a native application, you can never truly trust the stacks. You’re right that MSFT no longer uses FPO, but if the symbols for a DLL in the stack are not loaded, the StackWalk64 API, which all debuggers use, goes through heuristics to walk the stack. Heuristics is a fancy name for guessing. J For example, if you don’t have the PDB files loaded for a module, the stack walking code will show you the closest symbol to an address. That symbol could actually come from another DLL. Once you load the PDB file, the correct symbol is available so the address symbol will change in the call stack window.

Can’t say I really got to the root of this dependence on PDBs, though. If anyone out there cares to shed more light over this, I’d love to hear!

Advertisements
This entry was posted in Debugging, Visual Studio. Bookmark the permalink.

5 Responses to ‘Frames below may be incorrect’, or: Stack Walking Requires Symbols

  1. Barry Kelly says:

    * Standard calling conventions aren’t always used. One of the odder ones is used by Delphi’s RTL in its LStrCat routine, to concatenate multiple strings. It takes a variable number of arguments, but unlike the C calling convention, the callee pops the stack, not the caller. In order to do that, it needs to pop the return EIP off the stack, then pop off all the arguments, before jumping to the return EIP.

    * Odd callbacks can do “clever” things. For example, WNDPROC is just a function pointer, rather than a closure with associated data. One way of getting extra data (so the window procedure knows which window instance it’s dealing with) it is by using a thunk; a runtime-generated blob of code which loads extra data before forwarding to the real procedure, which then has the associated data. Your stack trace went wrong with an MFC message dispatch routine; that’s suspect.

    FWIW, John Robbins is wrong in asserting that all debuggers use StackWalk64 (though the case for using it is stronger for 64-bit than StackWalk for 32-bit, as x64 calling convention is much more rigidly standardized); I wouldn’t *necessarily* expect it to understand dynamically generated code in .NET, GCC and LLVM produced code, Delphi-compiled code, LuaJIT, etc.

  2. Ofek Shilon says:

    Barry – thanks! In general I wouldn’t hold it against MS-debuggers if they wouldn’t respect non-MS custom calling conventions.
    Also while I can’t state anything about ‘all debuggers’, the x86 architecture poses some very hard constraints over stack structure (most importantly, the operation of the ret instructio) – and hence over x86-stack-walking mechanisms in any software platform that is compiled or JITted. In particular, the platforms you noted. I actually find John Robbins assertion about all debuggers quite plausible.

  3. dave says:

    I have this exact problem but unfortunately I can’t undersatnd a word you’re saying :(

  4. Anonymous says:

    Hi, I have a thread whose stack is (after a first-chance exception) :

    DLLXX.dll!CDeviceReceiveThreadFunc(void * param) Line 2141 + 0x33 bytes C++
    kernel32.dll!@BaseThreadInitThunk@12() + 0x12 bytes
    ntdll.dll!___RtlUserThreadStart@8() + 0x27 bytes
    ntdll.dll!__RtlUserThreadStart@8() + 0x1b bytes

    Applications is a native windows service and
    DLLXX.dll symbols are loaded. I have DLLXX source code but pdb’s are from VS6 and i’m debugging in VS2010. After seeing your post i wondered if you could tell me why it seems that stack of DLLXX are not shown correctly because there are call’s to functions previous to the place where exception happens

    • Ofek Shilon says:

      @Anonymous: My guess is that this is actually a correct stack.
      (1) There’s no problem that i’m aware of in debugging with VS6 symbols,
      (2) There’s no ‘Frames below my be incorrect’ warning,
      (3) The top function name indicates that it is a thread function – i.e., created by a CreateThread (or similar), **on a different stack**.

      You should be able to inspect all the process stacks, e.g. by double clicking different threads in the ‘threads’ window. You might find there your ‘previous’ functions – but the creating thread could have moved on since, or could be waiting on the DeviceReceived thread.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s