Presenting at Windows Platform Developers Israel User Group

I’ll be giving a talk at a local Windows Developers user group meeting, titled ‘Undocumented Native Debugging Tricks’.  In essence I’ll be surveying as much as time permits of my collection of lesser-known VS tricks – many of which were published at this blog. Abstract quote:

Visual Studio provides great debugging facilities as is – but, it also contains a tremendous wealth of useful debugging features, that just never matured enough to be documented and supported. Such hidden features range from enhanced interaction with the debugee, improved debugging productivity, better state diagnostics, better exploration of code flow, and more. This talk would survey as many of those goodies as time permits.

The meeting would be held on Wednesday Sep 7th at 19:00, at Microsoft offices at Raanana-south junction.  

Come and say hello!

Posted in VC++ | Leave a comment

VC++ Version Boundaries

Using a binary built in VC verXXX from a binary built in VC verYYY is very dangerous.

This is very obvious in retrospect, but real life recently forced us to try just that: we migrated to VS2010, and a few 3rd party components still hadn’t. Below are two of the more general lessons we learnt along the way – hope that they might save someone some trouble.

Lesson 1: You can’t pass STL containers as arguments in a call from one version to another.

STL classes memory layout is subject to change in major versions. This is true in principle for all internal structures (vftable / vbtable layout, RTTI tables, _com_XXX helpers, whatever), but seems true in practice only for STL containers.

As an example take std::vector, and examine its /d1reportclasslayout dumps. In VS2005:

class ?$vector@HV?$allocator@H@std@@ size(20):
+—
| +— (base class                                            ?$_Vector_val@HV?$allocator@H@std@@)
| | +— (base class _Container_base)
0 | | | _Myfirstiter
| | +—
4 | | ?$allocator@H _Alval
| | <alignment member> (size=3)
| +—
8 | _Myfirst
12 | _Mylast
16 | _Myend
+—

while in VS2010:

class ?$_Vector_val@HV?$allocator@H@std@@    size(20):
+—
| +— (base class _Container_base12)
0    | | _Myproxy
| +—
4    | _Myfirst
8    | _Mylast
12    | _Myend
16    | ?$allocator@H _Alval
| <alignment member> (size=3)
+—

Quite a few changes has taken place – the implementation moved from within vector to the parent _Vector_val, the parent’s member _Container_base::_Myfirstiter was replaced by _Container_base12::_Myproxy, and what have you. The change of interest in this context, however, is the seemingly benign move of the allocator member _Alval, from the start of the object to its end – thereby rendering VS2005-generated vectors completely unreadable to VS2010-generated binaries.

Lesson 2: You can’t allocate memory in one version and free it in another.

– because the DLLs were linked against different CRT versions, and thus use separate CRT-heaps.

At process startup, every dependent DLL entry-point is called – ‘DllMain’ in user DLLs (by default), ‘CRTDLL_INIT’ for MSVCRXXX.DLL. That makes two such calls, in two CRT DLLs. Each of these call  _heap_init, which includes:

heapinit

Where _crtheap is a static handle, accessible via _get_heap_handle.

Each CRT DLL maintains bookkeeping structures referring to its own heap (essentially linked lists with block usage info). When you delete (or free) a pointer the CRT tries to update its bookkeeping. If the pointer was allocated on another heap – say, one created by a different CRT DLL – the bookkeeping update fails, and all hell breaks loose.

This is all more than legit.

For one, MS openly declares in various channel 9  interviews and podcasts that they break binary compatibility in every major VS version. Second, MS did deliver COM which is an exceptionally stable ABI (I’m not sure there’s even a concept of COM-version). Third, and most importantly, c++ does not have a standard ABI. Moreover, MS – unlike others – go out of their way to respect backward compatibility. Certainly evey framework must be allowed room to evolve, etc. etc. etc – from every angle I can think of, no contract was broken here.

And still.

Still, the scenario of gradually upgrading a multi-vendor app is fairly basic, and I shouldn’t be made to jump through unnecessary hoops to achieve that. Certainly swapping around the layout of std::vector is such an unnecessary migration barrier.

Posted in VC++ | 1 Comment

Breaking on Data Read

Edit: As of windows 10 the details and code below do not work. A working alternative is detailed at a newer post.


 

You’re probably familiar with Data Breakpoints, and rightfully so: It’s extremely useful to know where a value changes. But did you know that with a little help VS can break when a value is used?

Usage

By ‘little help’ I mean some code. Plenty of free implementations are available: 1, 2, 3, 4. All are very similar (one notable difference discussed below), but I’m used to the first and will use it below.

First, a toy example:

#include &amp;amp;quot;Breakpoint.h&amp;amp;quot;
...
CBreakpoint g_bp;
…
void Whateva()
{
int a = 3, b;
g_bp.Set(&amp;amp;amp;a, 4, CBreakpoint::Read);
b = a; // g_bp breaks!
g_bp.Clear();
}

The link does mention that you can call CBreakpoint::Clear() from a QuickWatch window (== from anywhere the Expression Evaluator lives, for that matter). What’s even more useful – you can call CBreakpoint::Set() from the debugger with a minor additional cast. While debugging the code above, evaluate the following in any watch window:

g_bp.Set(&a, 4, (CBreakpoint::Condition)3)

image

Internals (well, some, anyway)

Both read and write breakpoints are implemented via debug registers: special registers on a x86 CPU which trigger an ‘int 1’ interrupt (‘debug step’) whenever a pre-specified virtual address is accessed. Debug Register Dr7 is set to activate any hardware breakpoint, Dr0-Dr3 determine the type (11b means read/write).

All implementations linked above modify the debug registers via SetThreadContext. The documentation includes a grave warning, that only implementations 3 & 4 seem to respect:

Do not try to set the context for a running thread; the results are unpredictable. Use the SuspendThread function to suspend the thread before calling SetThreadContext.

However I’ve never had issues with implementation 1, so I assume in practice usage of SetThreadContext with this particular mask (CONTEXT_DEBUG_REGISTERS) is safe.

This usage makes one wonder – are debug registers indeed part of a thread context? Are they reset on every context switch?

The intel manuals, vol 3A, section 16.4.2 details the contents of DR7:

The debug control register (DR7) enables or disables breakpoints and sets breakpoint conditions … The flags and fields in this register control the following things:

• L0 through L3 (local breakpoint enable) flags (bits 0, 2, 4, and 6) — Enables (when set) the breakpoint condition for the associated breakpoint for the current task. When a breakpoint condition is detected and its associated Ln flag is set, a debug exception is generated. The processor automatically clears these flags on every task switch to avoid unwanted breakpoint conditions in the new task.

Oh dear. Are hardware breakpoints indeed that useless? Are they indeed blind to reads/writes by other threads?

Well obviously, no. It’s a 1 minute test to set a HW-Bp, modify its address from a different thread and watch it trigger.

It all boils down to a nuance in x86 terminology: tasks are not threads. Windows does not use the task context switching hardware apparatus that x86 offers, so it really is an OS decision whether to store debug registers per thread – and the obvious choice seems to store them per process. That is probably the reason calling SetThreadContext with CONTEXT_DEBUG_REGISTERS mask is safe also for non-suspended threads.

Posted in Debugging, VC++, Win32 | 6 Comments

_VC80_UPGRADE and Warning RC4005 (IDR_MANIFEST Redefinition)

The _VC80_UPGRADE macro seems to cause some confusion around, as does the warning

'warning RC4005: 'IDR_MANIFEST' : redefinition.'

While it was tempting to try and smear these issues on two posts, fact is they are one. A disk search finds a single reference to both, at afxres.h:

#if defined(_VC80_UPGRADE) && (_VC80_UPGRADE >= 0x0700) && (_VC80_UPGRADE < 0x0800) && defined(IDR_MANIFEST)
  // Handle project upgrade from VC7/VC7.1 for projects with manifest
  #define IDR_MANIFEST    1024
#endif

Which still doesn’t remotely pass as documentation. The closest I could find is this Connect reply:

We left this warning deliberately. Our upgrade support manages to make sure that your old project’s manifest doesn’t clash with the manifest that our tools now insert automatically. Our upgrade conversion warns you of this in its conversion log. However, this compile-time warning serves as a reminder that you should remove the manifest entry from your .rc file. When that’s done, you can undefine the _VC80_UPGRADE symbol.

The relevant conversion log snippet, quoted in the connect entry, is somewhere between obscure and misleading:

Due to the requirement that Visual C++ projects produce an embedded (by default) Windows SxS manifest, manifest files in the project are automatically excluded from building with the Manifest Tool. It is recommended that the dependency information contained in any manifest files be converted to “#pragma comment(linker,”<insert dependency here>”)” in a header file that is included from your source code. If your project already embeds a manifest in the RT_MANIFEST resource section through a resource (.rc) file, the line will need to be commented out before the project will build correctly.

Here’s what I make of it all – and this isn’t anywhere near official, so tread with caution.

Conjecture 1: In VC versions from 0x700 (inclusive) to 0x800 (non inclusive), embedding external manifests via resource scripts was either automatic or encouraged (certainly supported) as a way to declare dependencies upon specific assembly versions. From VS2005 (VC 0x800) onwards, the manifest tool automatically generates and embeds such manifests into the built binaries.

Conjecture 2: Judging by the conversion log (‘If your project already embeds a manifest in the … rc file, the line will need to be commented out before the project will build correctly’) , the project upgrade wizard didn’t handle well the conversion of the RT_MANIFEST resource section. Judging by the Connect response (‘Our upgrade …manages to make sure that your old project’s manifest doesn’t clash…’) the upgrade wizard probably did a good job, but for whatever reasons preferred not to delete the RT_MANIFEST section itself.

If you put your heart into it you might reconcile these two phrasings (maybe by ‘build correctly’ they mean with no warnings? Maybe they just made sure the manifests ‘didn’t clash’, but couldn’t decide which was correct?). Doesn’t matter – one way or another MSFT didn’t feel comfortable with auto-upgrading the RC-manifest and sought ways to encourage you to look into it in person.

Conjecture 3: Setting _VC80_UPGRADE to 0x700 and duplicating the definition of IDR_MANIFEST is just MSFT nudging you to peek into your old .RC script, confirm the choices the auto-update made and delete the entire RT_MANIFEST section when you’re done.

If any of this is true (which I can’t verify), it is a weird, weird hack by the VC team. At the very least I would expect some guidance near the (deliberate) warning origin. Even an informative comment – anything more than ‘Handle VC7.0 upgrade’ would be nice – but why not customize the warning?

#if defined(_VC80_UPGRADE) && (_VC80_UPGRADE >= 0x0700) && (_VC80_UPGRADE < 0x0800)
  #ifdef IDR_MANIFEST
#pragma message("We have reason to believe you are using a custom manifest as an embedded resource. Are you? Please don't. Really. The build process now generates and embeds a manifest automatically and there are better ways to guide it if you must. (Although try not too. It tickles.)")
  #endif
  #define IDR_MANIFEST    1024
#endif

Some Concrete Suggestions

If you did see ‘warning RC4005: ‘IDR_MANIFEST’ redefinition’, don’t ignore it. Inspect your project’s RC file – and specifically the RT_MANIFEST section. If it doesn’t contain any meaningful dependencies – erase it and be done with it. If it does hold some assembly references that you decide you want to keep – migrate them to some modern means: either #pragma comment(linker) as the conversion log suggests (I can’t think of a scenario where that’s the better way, but who knows), or use the /MANIFESTDEPENDENCY liker switch – either at the command line or via project properties, at Linker\Manifest File \Additional Manifest Dependencies.

Whether you saw the warning or not, whether you did migrate manifest settings or not, the _VC80_UPGRADE macro is now just clutter. It arrived at your projects via an added property sheet reference, and the civilized way of clearing it is via View\Property Manager:

The screenshot is from VS2010, and of clearing some embarrassing upgrade clutter from VC6.0 (!). Oh well, better late.

Posted in VC++, Visual Studio | Leave a comment

g_dwLastErrorToBreakOn: Watching Errors on VS Revisited

Raymond Chen posted about SetLastError recently, and an interesting discussion ensued. One comment in particular caught my eye:

The easiest way to catch a specific last error value in debugger is to set ntdll!g_dwLastErrorToBreakOn to that value.

A good while back I needed to break when such a LastError is set, and dug up all sorts of hacks to do so – breaking at SetLastError, setting data breakpoint on the thread-env-block-error, and the like. Beyond being cumbersome and plain ugly such breakpoint tricks can be very slow, and give a lot of false positives.

Seems the Win32 folks had similar needs and I was glad to discover they formed a better (undocumented, but still) solution. The authoritative source seems to be a 2007 post from Microsoft’s Daniel Pearson:

..Hiding inside of kernel32’s address space is a global variable called g_dwLastErrorToBreakOn. It turns out that SetLastError checks the value of this variable and if it’s non-zero, calls DbgBreakPoint if the two [values] match.

It’s a zero-overhead trick, and is very easy to do in Visual Studio: make sure kernel32.dll symbols are loaded, then type in a watch window –

(int*){,,kernel32.dll}_g_dwLastErrorToBreakOn

– and edit the referenced int:

image

Pearson notes two changes introduced in Vista:

(1) Up until XP, only Win32 API implemented in KERNEL32.DLL actually used SetLastError (and so tested g_dwLastErrorToBreakOn) – other dll’s used to set the error value via RtlSetLastWin32Error. Since Vista all Win32 API which set an error do so with SetLastError, so the g_dwLastErrorToBreakOn is much more reliable.

(2) Since Vista, g_dwLastErrorToBreakOn moved to NTDLL.DLL, so the VS usage should be changed to –

(int*){,,ntdll.dll}_g_dwLastErrorToBreakOn

It’s interesting to note that ntdll.dll does contain a separate instance of g_dwLastErrortToBreakOn also on XP machines:

image

But I verified that this value is never read, on calls into both kernel32 and ntdll.

Posted in Debugging, Win32 | 3 Comments

Debugging Reference Count – Part 1

I recently dealt with a large memory leak that turned out to be a delicate reference count issue. It is a common debugging scenario, and I’ll be sharing here some suggestions about it.

First I had to isolate the leaking object. It so happened the leaking allocation was in a consistent place so setting_crtBreakAlloc was enough. Seeing that it was ref-counted, my first thought was to break on AddRef and Release, with a condition for ‘this’ to equal the object address:

image

This actually works, but is excruciatingly slow since any sane AddRef is templated and instantiated many times. A better method is to set a data breakpoint on the reference count itself:

image

(God knows why this BP is split into children. I guess I was quick to say this is resolved in VS2010).

If the debugged scenario is non trivial, you might prefer to not actually break, but rather dump stacks on any ref count change and match the stacks offline. Enter tracepoints:

image

Can we enhance the stack dumps with the current refcount value? You bet. Just add the contents of (‘{ }’) the refcount (0x037a17a4 in this case):

image

In my case the dumps held ~3000 stacks, so manual analysis was out of the question. Now I knew the leak was limited to a certain usage scenario, so I took trace dumps in leaking and non-leaking scenarios, saved both to files, diff’ed them – and the offending stack popped out immediately.

Next time – a similar trick in a different scenario.

Posted in Debugging, VC++ | Leave a comment

The Case of the ‘X’ That Didn’t Kill the App

One of our MFC apps recently had a weird bug: occasionally debug builds would result in a binary where the ‘X’ corner button killed the app window but not the app – it would just keep idle indefinitely until killed from the task manager. I found no similar cases online and figured the investigation is worth sharing.

The immediate cause was quick to isolate – it was located in wincore.cpp, at CWnd::OnNcDestroy() :

// WM_NCDESTROY is the absolute LAST message sent.
void CWnd::OnNcDestroy()
{
  // cleanup main and active windows
  CWinThread* pThread = AfxGetThread();
  if (pThread != NULL)
  {
    if (pThread->m_pMainWnd == this)
    {
      if (!afxContextIsDLL)
      {
        // shut down current thread if possible
        if (pThread != AfxGetApp() || AfxOleCanExitApp())
          AfxPostQuitMessage(0);
      }
      pThread->m_pMainWnd = NULL;
  }
  if (pThread->m_pActiveWnd == this)
    pThread->m_pActiveWnd = NULL;
}

In the problematic builds, when OnNcDestroy was called for the main window of the main thread, afxContextIsDll would always evaluate to true, so AfxPostQuitMessage was never called.

afxContextIsDll is defined in afxwin.h :

#define afxContextIsDLL     AfxGetModuleState()->m_bDLL

The investigation that followed was not so quick, as any reference to MFC module states is a lid of a hefty can of worms.  Most of the MODULE_STATE apparatus is implemented in afxstate.cpp, and specifically:

AFX_MODULE_STATE* AFXAPI AfxGetModuleState()
{
  _AFX_THREAD_STATE* pState = _afxThreadState;
  ENSURE(pState);
  AFX_MODULE_STATE* pResult;
  if (pState->m_pModuleState != NULL)
  {
    // thread state's module state serves as override
    pResult = pState->m_pModuleState;
  }
  else
  {
  // otherwise, use global app state
  pResult = _afxBaseModuleState.GetData();
  }
ENSURE(pResult != NULL);
return pResult;
}

_afxThreadState has static access, is a not-so-thin wrapper around some thread-local storage, and implements a non trivial operator=() – so it is not a simple matter of placing a data breakpoint and see who modifies it. Travelling the code gave several direct modification paths (there is in fact an AfxSetModuleState) but most of the action seemed to be in the construction of global THREAD_STATE objects, very early in the app lifecycle. Things got hairy.

On my way home a certain suspicion started to arise.

I arrived at the office the next day, and indeed:

doublemodules

The app linked against both debug and retail versions of the MFC runtime dll.

This entails probably thousands of violations of the ODR principle, and even worse – by the c++ standard such violations do not need to be reported by the linker!

Each of the two MFC dll’s maintain its own MODULE_STATE. During link, two matches for AfxSet/GetModuleState are seen by the linker (one in each MFC version) .Which m_bDLL is altered depends on which version of AfxSetModuleState is resolved. Which m_bDLL state is seen by CWnd::OnNcDestroy() depends on which version of AfxGetModuleState() is resolved!

I paused the app and used the context operator to verify at the watch window:

{,,mfc80d.dll}(*(AfxGetModuleState())).m_bDLL                1

{,,mfc80.dll}(*(AfxGetModuleState())).m_bDLL                   0

The two MFC dll versions (debug and release) see different states of the module.

While the investigation so far did not reveal the explicit flow that generated the inconsistency, obviously the erroneous dependency on the release MFC dll (in debug builds) must be removed.

I rebuilt the solution with /VERBOSE. The output showed a seemingly innocent static lib, written over 3 years ago, that linked against mfc80.dll in debug builds for no apparent reason. A look into the lib properties showed that in debug builds only it was marked to ignore all default libs… This erroneous switch was removed and the problem did not appear since.

Bottom Lines

If you experience the same symptoms, here’s one thing to try:

  1. Check at the Modules window for dependencies on both debug and release versions of MFC or another runtime component.
  2. If any are found, use /VERBOSE to track the unexpected dependency.
  3. This dependency might be explicit, or generated via /NODEFAULTLIB. Either way, Remove it!
Posted in Debugging, VC++ | 1 Comment

AfxIsValidAddress (and Others) Don’t Work as Advertised

MFC exposes a some memory debugging facilities such as AfxIsValidAddress, which (for debug builds) supposedly

Tests any memory address to ensure that it is contained entirely within the program’s memory space.

Or does it?   AfxIsValidAddress only delegates the call to the undocumented ATL::AtlIsValidAddress, which reads:

// Verify that a pointer points to valid memory
inline BOOL AtlIsValidAddress(const void* p,
                  size_t nBytes, BOOL bReadWrite = TRUE)
{
      (bReadWrite);
      (nBytes);
      return (p != NULL);
}

The first two lines are just no-ops to silence compiler warnings about unused parameters. The promised verification that p is-contained-entirely-within-the-program’s-memory-space, amounts to the test that p is not null.

AfxIsValidString identically does not hold to its word and tests only if the input string is non-null:

inline BOOL AtlIsValidString(LPCSTR psz,
                          size_t nMaxLength = UINT_MAX)
{
      (nMaxLength);
      return (psz != NULL);
}

AfxAssertValidObject  is undocumented, and runs a bunch of redundant AfxIsValidAddress-es. CObject::AssertValid is documented (‘performs a validity check on this object by checking its internal state’!!), and is just as disappointing:

void CObject::AssertValid() const
{
     ASSERT(this != NULL);
}

The reason is probably that somewhere along the way MS realized there is no sane way to keep such promises.

There are several Win32 API similar in functionality: IsBadWritePtr, IsBadHugeWritePtr, IsBadReadPtr, IsBadHugeReadPtr, IsBadCodePtr, IsBadStringPtr. It has been known since at least 2004 that these functions are broken beyond repair and should never be used. The almighty Raymond Chen and Larry Osterman both discuss the reasons in detail, so just a short rehash: IsBad*Ptr all work by accessing the tested address and catching any thrown exceptions. Problem is that a certain few of these access violations (namely, those on stack guard pages) should never be caught – the OS uses them to properly enlarge thread stacks.  In the words of Michael Howard, who first realized this (or so I think – both Larry and Raymond attribute the CrashMyApplication nicknames to him):

You should also not catch all exceptions, but only types that you know about. Catching all exceptions is just as bad as using IsBad*Ptr.

I’m guessing that AfxIsValidAddress – and sisters – worked the same way, until someone realized they too were probably generating more debugging effort than they were saving. However, while the Win32 guys decided to leave their API semantics as are and clearly document them as obsolete and dangerous, the MFC guys turned their API into no-ops and did cleanup work that cannot be described as anything but sloppy. Not only is the documentation wrong, source comments are too:

// AfxIsValidAddress() returns TRUE if the passed parameter points
// to at least nBytes of accessible memory. If bReadWrite is TRUE,
// the memory must be writeable; if bReadWrite is FALSE, the memory
// may be const.

BOOL AFXAPI AfxIsValidAddress( ...

– and even the MFC sources themselves still contain many naive uses of these no-op tests (search around, you’ll find plenty).

Bonus: What If You Really Have To Test Memory for Validity?

This is after all not so far fetched. Years ago I had to work around a bug in nVidia GPU drivers, where occasionally some API (namely IDirect3DVolumeTexture9::LockBox, applied on a huuuuuuuge texture) returned success but gave a bogus memory buffer. The workaround was to VirtualQuery the address, and if not accessible – retry the process on a smaller volume texture. It went something like –

...
D3DLOCKED_BOX box;
HRESULT res = pVolumeTexture->LockBox(0, &box, NULL,
                                     D3DLOCK_DISCARD);
_ASSERT(res==D3D_OK  && box.pBits);
// this should have sufficed, but...

MEMORY_BASIC_INFORMATION meminf;
VirtualQuery(box.pBits, &meminf, sizeof(meminf));

BOOL  bOk = (meminf.State == MEM_COMMIT)  &&
      ( 0 !=  (meminf.Protect &
      ( PAGE_READWRITE | PAGE_WRITECOPY | PAGE_EXECUTE_READWRITE));

if (!bOk)
 //fallback
...

Of course VirtualQuery is innately sluggish, so this is completely unacceptable as mere parameter validation. Use this only when you know you have to, not just to be on the safe side.

Posted in MFC, Win32 | Leave a comment

Child Breakpoints in Visual Studio

You often see in the breakpoints window that certain breakpoints are expandable:

bpwin1

These are called child breakpoints, and are a strong contender to the title of most poorly documented feature of VS.  According to MSDN, child breakpoints occur –

…when you set a breakpoint in a location such as an overloaded function…

…when you attach to more than  one instance of the same program.

And I’m not sure what they mean by either. AFAIK, a single debugger cannot attach to more than one instance of a program, and different overloads of the same functions reside in different code lines – so they cannot possibly count for child breakpoints.

It’s reasonable to assume that child breakpoints are generated whenever a breakpoint is set in a line of code  that is compiled into multiple binary locations. I’m aware of two such scenarios – but there may be more:

  1. Templated function/method instantiated with different arguments,
  2. A function in a static library linked into multiple executables (exe, dll) in a single solution.

Perhaps by ‘overloaded function’ MSDN means (1) and by ‘more than one instance of the same program’ they mean (2)?  I sure hope not.  Also, the ever generous John Robbins confirmed the 1st case by email.

It’s good to be aware of this feature because it does have debugging performance implications – In a large project such that I work on, setting a breakpoint in (e.g.) smart pointer code results in spreading INT 3 instructions in thousands of instantiations – occasionally causing full minutes of IDE stall. There are often better alternatives – maybe more on that in a future post.

To make things even more interesting, in VS2005 this feature was badly bugged: more often than not a breakpoint was split into children for no apparent reason. This seems to be fixed in VS2010.

Posted in Debugging, Visual Studio | 1 Comment

‘Frames below may be incorrect’, or: Stack Walking Requires Symbols

Here’s the symptom – you stop and inspect a stack:

stack_dlg1_cut

Note the message at the second line:

Frames below may be incorrect and/or missing. No symbols loaded for XXXXX.dll.

Chances are you read it once years ago and ignored it ever since.  If so, that’s a shame – because the debugger is dead serious about it.

You load some missing symbols (either from the modules window, or by right clicking a line on the stack window), and the stack changes. Often, the code displayed (topmost frame where code is available) is seen to be misleading – it’s nowhere to be found on the updated stack!

stack_Dlg2_cut

This happens a lot, e.g. when uncaught exceptions are thrown outside your own code, when you pause execution, or when switching to a different thread while stopped at a breakpoint.  Essentially – it happens whenever stack walking has to start from an optimized module without loaded symbols (in particular, MS modules like ntdll above).

First Corollary

Loading MS-symbols is NOT optional!

Hopefully every dev in the civilized world knows what these are and how to get them (MS made it considerably easier since VS2008), so I will not rehash. What is not widely obvious, is that without them there’s a good chance you’re looking at wrong stacks.

Second Corollary

Apparently stack walking is harder than it seems, and depends in some way on debug information.

That was news to me, and called for some research.

Naive Stack Walking

[A full x86-stack-layout tutorial is an undertaking way beyond the extent of my spare time. Try e.g. here for a nice read. The following is a very rough and minimal description]

Ignoring most stack content (function parameters, local variables, calling conventions, exception handling, buffer security and such), a naive layout of an x86 stack frame is something like –

Memory address:   Stack elements:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 0x104FD8   |  | Parameters                 | \
            |  +----------------------------+  |
 0x104FD4   |  | Return address, routine 2  |  |
            |  +----------------------------+   >  Stack frame 3
 0x104FD0   +--| EBP value for routine 2    |  |
               +----------------------------+  |
 0x104FCC   +->| Local data                 | /  <-- Routine 3's EBP
            |  +----------------------------+
 0x104FC8   |  | Return address, routine 3  | \
            |  +----------------------------+  |
 0x104FC4   +--| EBP value for routine 3    |  |
               +----------------------------+   >  Stack frame 4
 0x104FC0      | Local data                 |  | <-- Current EBP
               +----------------------------+  |
 0x104FBC      | Local data                 | /
               +----------------------------+
 0x104FB8      |                            |    <-- Current ESP
                \/\/\/\/\/\/\/\/\/\/\/\/\/\/

Where the Extended Base Pointer slots store the contents of the register that marks the start of a stack frame. So obtaining a stack trace should be as simple as

  1. Walking the chain of EBPs (each EBP slot on a stack frame points to just below the EBP slot of its parent frame) to form the stack of frames.
  2. From above each EBP slot, take the stored EIP and use it to decipher the calling module and the calling function name.

Phase (2) is indeed best done with symbols, but  can be achieved with some success (exported functions only) with only linker map files. Either way, this description cannot account for the dependency of the stack-frames partition itself upon the presence of symbols! There has to be more to the story.

First Exception – FPO

Ever since the Intel 386, both ESP and EBP can serve as reference points for local stack variables, and faced with the scarcity of registers on x86 systems it was a tempting optimization to drop EBP dedicated usage altogether. This is called Frame Pointer Omission, and indeed mandates dedicated stack-walking assistance info in the PDB – as the traditional EBP chain breaks completely once a single stack frame uses FPO. However, FPO have been rarely used in practice and completely disabled in MS builds since Vista so it cannot possibly account for all occurrences of bad stack traces.

Others Exceptions?

Well, probably – but not documented ones. I have several reasons to believe so:

(1) StackFrameTypeEnum (used by IDiaStackFrame) indeed includes FrameTypeStandard and FrameTypeFPO, but also other frame types (notably FrameTypeFrameData).

(2) The VC team blog did hint that a PDB includes, quote,

the unwind program to execute to walk to the next frame

(3) I’ve witnessed these symptoms on dumps taken from Win7 machines – and AFAIK Win7 binaries were compiled without FPO.

(4) The venerable John Robbins answered an email of mine, saying:

Yes, unless you have all symbols loaded for a native application, you can never truly trust the stacks. You’re right that MSFT no longer uses FPO, but if the symbols for a DLL in the stack are not loaded, the StackWalk64 API, which all debuggers use, goes through heuristics to walk the stack. Heuristics is a fancy name for guessing. J For example, if you don’t have the PDB files loaded for a module, the stack walking code will show you the closest symbol to an address. That symbol could actually come from another DLL. Once you load the PDB file, the correct symbol is available so the address symbol will change in the call stack window.

Can’t say I really got to the root of this dependence on PDBs, though. If anyone out there cares to shed more light over this, I’d love to hear!

Posted in Debugging, Visual Studio | 5 Comments