Accelerating Debug Runs, Part 2: _ITERATOR_DEBUG_LEVEL

A previous post discussed the Windows Debug Heap – with the main motivation being how to avoid it, as it is just empty, expensive overhead, and it isn’t clear why it is on by default in the first place. Remarkably, 3 weeks after posting here the VC team announced that from VC “14″ (now in CTP) onwards the WDH will be opt-in and not opt-out, as it should be. So hopefully in ~2 years the recommendations in the last post will be largely obsolete. Anyway, I’m often facing unworkably slow run times in debug even after disabling the WDH, and further measures are in order.

In debug builds the VC++ CRT implementation runs a hefty load of iterator-validation tests on all STL containers – the simplest example being raising an assertion when an std::vector subscript is out of range. This leads to the unfortunate reality wherein if someone writes C++ code ‘by the book’, using standard containers for everything, s/he often ends up writing code that is utterly unusable in debug.  In one of our projects image classes were coded with std::vector as the container for the image bits. The product code is very intensive in iteration on pixels of many images, and as a result debug builds completed a typical job in ~4 hours, whereas a release build completed in ~4 minutes. For a long while debugging that project was reduced to logging and stepping in disassembly, as debug builds were completely useless.

Now for some good news and some bad news. The good news is that this behavior is customizable via the _ITERATOR_DEBUGGING_LEVEL macro: #define it to 0 (or 1, if you have a particular need for it) early in the compilation – say in the project properties or the top of a precompiled header– and this disproportional computational overhead is gone.

The bad news is that this doesn’t work.

MSVCMRTD.lib(locale0_implib.obj) : error LNK2022: metadata operation failed (8013118D) : Inconsistent layout information in duplicated types (std.basic_string<char,std::char_traits<char>,std::allocator<char> >): (0x0200004e).

Well that was a tad dramatic – it doesn’t always work, and in /clr builds in particular.

<rant>Now /clr projects will probably forever be second class citizens in the VC universe. Features will forever be coded for mainstream native code and will trickle down to /clr code as time and priorities permit (two notable examples are data breakpoints and debugger visualizers that are still unsupported in mixed debugging, but trust me – there are plenty more). </rant> Anyhow, as far as _ITERATOR_DEBUG_LEVEL goes – this is much more a bug than something resembling a decision. The venerable Stephan T. Lavavej elaborates (3rd reply from the bottom):

…The underlying problem is that _ITERATOR_DEBUG_LEVEL affects the representations of STL containers, and C++ (both native and especially managed) really hates it when code can’t agree on the representation of an object.  When _SECURE_SCL/_HAS_ITERATOR_DEBUGGING were added in VC8, we should have created 5 variants of the CRT/STL binaries (including DLLs).  Unfortunately we didn’t (this was before my time, otherwise I would have spoken up), and having only debug and release DLLs causes headaches.  We suffered from longstanding problems in VC8/9 until we overhauled how this worked in VC10.  During VC10 we untangled the worst of the problems by making std::string header-only.  With invasive surgery we were able to get native code working correctly in every case except one very obscure one that nobody has noticed or complained about yet.  (We now have 5 static libs, which solves the case of static linking absolutely 100% perfectly, but still only 2 DLLs.)  But managed code is structured differently, and the tricks that work in native don’t work for it.  As a result, customizing _ITERATOR_DEBUG_LEVEL basically doesn’t work under /clr[:pure].  Very few customers have encountered this (you’re one of the first) because we changed the release mode default to IDL=0 (which everyone wants), and few people want to modify debug mode’s default of IDL=2.

The thread is from Jan 2011 and this particular issue was resolved in VS2013. Similar issues remain, and I’m not sure /clr code would ever make it into routine test matrices in MS – so as the CRT code evolves, these issues would probably keep popping.

Bottom line: if – like me – you’re debugging C++ code that is both managed and makes extensive use of STL – your mileage may seriously vary when trying to customize iterator debug level. If you do develop purely native code and are trying to accelerate debug runs, I do recommend judiciously setting _ITERATOR_DEBUG_LEVEL to zero – and raising it back only when you’re tracking concrete iterator issues.

In the same reply Stephan offers an alternative:

Have you considered making your “debug build” compile in release mode with no optimizations?  Release mode versus debug mode affects what CRT/STL/etc. you link to (and whether you can effectively debug into them) and as a side effect affects your IDL default, but it’s not inherently tied to whether your own code is compiled with optimizations or not, and that’s what affects the debuggability of your own code.  The IDE pairs release mode with optimizations and debug mode without optimizations, but there’s no fundamental reason linking the two.

This makes sense and I briefly experimented with this approach. Sorry to report, still no success. While I still can’t pinpoint the root cause, even when compiling release builds with /Od (optimizations disabled) the debugging experience is severely crippled. (and yes, of course I raised the proper PDB generation switches in both the compiler and the linker). Local variable watch and single-steps seemed highly erratic, ‘this’ pointer for class methods seemed to stick with $rcx throughout the method and thus give rubbish on member watch – etc. etc.

However, this is a step in a better direction. More on that in the next post.

Accelerating Debug Runs, Part 1: _NO_DEBUG_HEAP

(A more appropriate but even-less-catchy title might have been ‘accelerating runs from the debugger‘. As elaborated below, these two are not strictly equal).

A common notion is that debug builds can and should carry as much debugging overhead as one can possibly scram in – after all, the point in debug builds is exactly this, debug, and you should never care about their performance. After too many cases of slow-to-the-extent-of-utterly-unworkable builds, I respectfully disagree. In this and the next post, a few techniques to make debug builds run faster are laid out.

Introducing the Windows Debug Heap

As many, many, have already discovered – the WDH is a big deal as far as performance goes, and yet MSDN is unusually terse about it. The HeapSetInformation page says:

When a process is run under any debugger, certain heap debug options are automatically enabled for all heaps in the process. These heap debug options prevent the use of the LFH. To enable the low-fragmentation heap when running under a debugger, set the _NO_DEBUG_HEAP environment variable to 1.

And in some arcane corner of the WinDBG documentation:

Processes that the debugger creates (also known as spawned processes) behave slightly differently than processes that the debugger does not create.

Instead of using the standard heap API, processes that the debugger creates use a special debug heap. You can force a spawned process to use the standard heap instead of the debug heap by using the _NO_DEBUG_HEAP environment variable or the -hd command-line option.

(While the latter was written for windbg, everything except the –hd switch holds equally for VS).

What are these ‘certain heap debug options’? What is the price in performance? Can the WDH be avoided altogether? Stay tuned.

Creating and Avoiding the WDH

The debugger itself calls IDebugClient5::CreateProcess2 which creates a debuggee process with WDH by default, and in general – I suspect – has more power over the debuggee than a normal CreateProcess would. The WDH creation can be bypassed by specifying DEBUG_CREATE_PROCESS_NO_DEBUG_HEAP in the options argument, and the MS debuggers do exactly that when the aforementioned environment variable _NO_DEBUG_HEAP exists and is set to 1.

(Note: somehow I suspect CreateProcess with the DEBUG_PROCESS flag is not enough for the created debuggee to run under the WDH, but I didn’t check).

You can set this environment variable either globally for the machine (as I do) or in a specific debug session via the project properties:

What the WDH Does

  1. The only documented effect is disabling the LFH – which makes sense, as these are different heap layouts and thus are mutually exclusive. You do lose some speedups by dropping the LFH but by and large this is a negligible factor compared to the others.
  2. On every allocation the memory manager initializes every allocated DWORD to 0xbaadfood, and on every deallocation sets the memory to 0xfeeefeee – in addition to some bookkeeping just after the allocated chunk. Here’s the normal view:

And here’s the view with _NO_DEBUG_HEAP=1:

These magic numbers can help in some debugging scenarios – use of uninitialized heap memory, and usage after free – but truth be told, they rarely do. Here are some more details. Most of the extra time, however, is not spent there.

  1. On every memory operation, the WDH walks the heap and checks for integrity! To observe, add some corruption:

And run:

Now run again with _NO_DEBUG_HEAP set to 1 – and watch the assertion vanish.

Err, this stuff actually sounds useful. Sure I should I disable it?

For regular C++ applications – beyond a doubt, yes.

the CRT delivers identical functionality, on top of the windows debug heap, with different magic numbers: 0xcdcdcdcd for fresh allocations and 0xdddddddd for freed memory. If you leave the WDH on you’re initializing memory chunks – and worse, checking heap integrity – twice for each allocation. In regular development scenarios WDH is just empty, very expensive overhead.

By ‘regular’ C++ programs I mean those that don’t do anything fancy with the heap and just stick to the built in CRT heap. You can overload new/delete, as long as your overloads eventually call the shipped new/debug/malloc/free, or some dbg/aligned siblings.

One potential argument in favour of leaving the WDH on is that unlike the CRT debug heap the WDH is operational in release builds also, but (1) it is disabled for any launch outside a debugger anyway, (2) in the extremely unlikely case that you’d require memory integrity checks but don’t want to run a debug build, I would suggest just editing your debug configurations to include optimizations. (add /O2).

Oh, and in our applications setting _NO_DEBUG_HEAP=1 accelerated some runs by a factor of 10. Nough said.


Edit (Oct 6 2014):

Remarkably, 3 weeks after initially publishing this post it seems the VC team themselves agree. Beginning with VS “14″, the WDH will be opt-in, not opt-out – as it ought to be.

Debugging Memory Corruption II

Some years ago I shared a trick that let’s you call _CrtCheckMemory from the debugger anywhere, without re-compilation.   The updated (as of VS2013) string to type at a watch window is:

{,,msvcr120d.dll}_CrtCheckMemory()

Let’s expand on that today, in two steps.

Checking memory on every allocation

The CRT heap accepts a neat little flag, called: _CRTDBG_CHECK_ALWAYS_DF.  Here’s how it used:

int main()
{
// Get current flag
int tmpFlag = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);

// Turn on corruption-checking bit
tmpFlag |= _CRTDBG_CHECK_ALWAYS_DF;

// Set flag to the new value
_CrtSetDbgFlag(tmpFlag);

int* p = new int[100]; // allocate,
p[101] = 1;   // corrupt,    and…

int* q = new int[100];  // BOOM! alarm fires here

}

Testing for corruption on every allocation can tangibly slow down your program, which is why the CRT allows testing only every N allocations, N being 16, 128 or 1024.  Usage adds half a line of code – pasted from MSDN:

// Get the current bits
tmp = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);

// Clear the upper 16 bits and OR in the desired frequency
tmp = (tmp &amp; 0x0000FFFF) | _CRTDBG_CHECK_EVERY_16_DF;

// Set the new bits
_CrtSetDbgFlag(tmp);
}

Note that testing for corruption on every memory allocation is nothing like testing on every memory write – the alarm would not fire at the exact time of the felony, but since your software allocates memory (even indirectly) very often – this will hopefully help narrow down the crime scene quickly.

Checking memory on every allocation – from the debugger

You might reasonably want to enable/disable these lavish tests at runtime.

The debug flags are stored in {,,msvcr120d}_crtDbgFlag, and the numeric value of _CRTDBG_CHECK_ALWAYS_DF is 4, so one might hope that these lines would enable and disable these intensive memory tests:

image

Alas, this doesn’t work – _CrtSetDbgFlag contains further logic that routes the input flags further to internal variables. The easiest solution is to just call it:

image

First two lines enable, last two lines disable.  If you’re running with non default flags, the actual values you’d see might be different.

Debugging Handle Leaks

This is all well documented stuff and I won’t go into details – it’s here mostly for self reference (3rd time I had to chase this down in google).

Steps are:

(1) Install WDK to integrate the WinDbg engine with VS (not strictly necessary, but very convenient).

(2) Attach to the debugee via ‘User Mode’ transport:

image

(3) Continue execution, and break at the spot where the handle count is at ‘reference’ value.

(4) At the ‘Debugger Immediate Window’ type ‘!htrace –enable’

(5) Continue execution and break at a point where the handle count is supposed to be at reference value but isn’t.

(6) At the ‘Debugger Immediate Window’ type ‘!htrace –diff’.

 

The offending stack[s] should be visible at the debugger immediate window.  If you get garbage, there’s a good chance you’re debugging a 32bit process on a 64bit machine.

Setting a Watch on Wide Registers in VS

General-purpose registers can be watched from the watch window pretty much as regular variables:

image

(the ‘$’ prefix is optional, but is recommended by MS – probably as means to minimize clashes with code variables.)

It is less known that you can set similar watches on SSE registers.  The direct approach doesn’t work:

image

- probably since the expression evaluator doesn’t have built in 128 bit types.  You can, however, set watches for specific portions of a wide register. First set a watch on a single float, with ‘xmm00’-like syntax (2nd number indicates the 32-bit slot to watch):

image

And next, you can watch 64-bit portions as doubles, with ‘xmm0dh’-like syntax:  ‘d’ stands for double, and l/h specifies high/low halves to watch.

image

This syntax went non-official after VS2003 (!). Up until VS2012 you could also watch 32-bit fractions of SSE registers as ints (some instructions use that) with ‘xmm0il’ syntax. This was mentioned in a Connect answer from 2009– but broke in VS2013.  From the VS2003 link it seems sometime around 2003 you could set similar watches to MMX registers, with ‘mm00’ like syntax.  Never seen it work on any VS version I used.  Maybe there’s similar syntax that enables AVX registers watch, hiding somewhere?   I don’t even have an AVX-enabled computer to guess on.


Update: the int watch (‘xmm0il’) syntax seems to be broken only for x64 builds.

 

VS2012 Migration #3: autoexp and NoStepInto Replacements

In the past I blogged quite a few times about two immensely useful albeit mostly-unofficial debugger features: watch modification via autoexp.dat, and step-into modification via NoStepInto registry key. A long while ago I raised two suggestions at MS UserVoice, to invest in making these two semi-hacks into documented, supported features. The first suggestion got some traction, and is officially implemented in VS2012. The 2nd suggestion went mostly ignored – but nevertheless, there’s a new and better – though still undocumented – way to skip functions while stepping.

NatVis files

The Natvis (native-visualizers) file format is the shiny new replacement for autoexp.dat. It is well documented, and although still quite rough around the edges – bugs are accepted and treated, which means that for the first time it is actually supported. The new apparatus comes with several design advantages:

  1. It seems to be better isolated and not to crash the IDE so much,
  2. New visualizer debugging facilities are built in,
  3. Separate customized visualizers can be kept in separate files, allowing easier sharing (e.g., library writers can now share distribute .natvis files with their libraries).
  4. Natvis files can be placed at per-user locations.

It isn’t that much fun rehashing the syntax – being official and all – but I will include here a custom mfc-containers natvis, similar to the autoexp section I shared a while back

<?xml version="1.0" encoding="utf-8"?>
<AutoVisualizer xmlns="http://schemas.microsoft.com/vstudio/debugger/natvis/2010">
  <!--from afxwin.h -->
  <Type Name="CArray&lt;*,*&gt;">
    <AlternativeType Name="CObArray"></AlternativeType>
    <AlternativeType Name="CByteArray"></AlternativeType>
    <AlternativeType Name="CDWordArray"></AlternativeType>
    <AlternativeType Name="CPtrArray"></AlternativeType>
    <AlternativeType Name="CStringArray"></AlternativeType>
    <AlternativeType Name="CWordArray"></AlternativeType>
    <AlternativeType Name="CUIntArray"></AlternativeType>
    <AlternativeType Name="CTypedPtrArray&lt;*,*&gt;"></AlternativeType>
    <DisplayString>{{size = {m_nSize}}}</DisplayString>
    <Expand>
      <Item Name="[size]">m_nSize</Item>
      <Item Name="[capacity]">m_nMaxSize</Item>
      <ArrayItems>
        <Size>m_nSize</Size>
        <ValuePointer>m_pData</ValuePointer>
      </ArrayItems>
    </Expand>
  </Type>

  <Type Name="CList&lt;*,*&gt;">
    <AlternativeType Name="CObList"></AlternativeType>
    <AlternativeType Name="CPtrList"></AlternativeType>
    <AlternativeType Name="CStringList"></AlternativeType>
    <AlternativeType Name="CTypedPtrList&lt;*,*&gt;"></AlternativeType>
    <DisplayString>{{Count = {m_nCount}}}</DisplayString>
    <Expand>
      <Item Name="Count">m_nCount</Item>
      <LinkedListItems>
        <Size>m_nCount</Size>
        <HeadPointer>m_pNodeHead</HeadPointer>
        <NextPointer>pNext</NextPointer>
        <ValueNode>data</ValueNode>
      </LinkedListItems>
    </Expand>
  </Type>
  
  <Type Name="CMap&lt;*,*,*,*&gt;::CAssoc">
    <AlternativeType Name="CMapPtrToWord::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapPtrToPtr::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapStringToOb::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapStringToPtr::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapStringToString::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapWordToOb::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapWordToPtr::CAssoc"></AlternativeType>
    <AlternativeType Name="CTypedPtrMap&lt;*,*,*&gt;::CAssoc"></AlternativeType>
    <DisplayString>{{key={key}, value={value}}}</DisplayString>
  </Type>

  <Type Name="CMap&lt;*,*,*,*&gt;">
    <AlternativeType Name="CMapPtrToWord"></AlternativeType>
    <AlternativeType Name="CMapPtrToPtr"></AlternativeType>
    <AlternativeType Name="CMapStringToOb"></AlternativeType>
    <AlternativeType Name="CMapStringToPtr"></AlternativeType>
    <AlternativeType Name="CMapStringToString"></AlternativeType>
    <AlternativeType Name="CMapWordToOb"></AlternativeType>
    <AlternativeType Name="CMapWordToPtr"></AlternativeType>
    <AlternativeType Name="CTypedPtrMap&lt;*,*,*&gt;"></AlternativeType>
    <DisplayString Condition="(m_nHashTableSize &gt;= 0 &amp;&amp; m_nHashTableSize &lt;= 65535">{{size={m_nHashTableSize}}}</DisplayString>
    <Expand>
      <Item Name="num bins">m_nHashTableSize</Item>
      <ArrayItems>
        <Size>m_nHashTableSize</Size>
        <ValuePointer>m_pHashTable</ValuePointer>
      </ArrayItems>
    </Expand>
  </Type>

  <Type Name="CMap&lt;*,*,*,*&gt;">
    <AlternativeType Name="CMapPtrToWord"></AlternativeType>
    <AlternativeType Name="CMapPtrToPtr"></AlternativeType>
    <AlternativeType Name="CMapStringToOb"></AlternativeType>
    <AlternativeType Name="CMapStringToPtr"></AlternativeType>
    <AlternativeType Name="CMapStringToString"></AlternativeType>
    <AlternativeType Name="CMapWordToOb"></AlternativeType>
    <AlternativeType Name="CMapWordToPtr"></AlternativeType>
    <AlternativeType Name="CTypedPtrMap&lt;*,*,*&gt;"></AlternativeType>
    <DisplayString>{Hash table too large!}</DisplayString>
  </Type>
  

  <Type Name="ATL::CAtlMap&lt;*,*,*,*&gt;">
    <AlternativeType Name="ATL::CMapToInterface&lt;*,*,*&gt;"/>
    <AlternativeType Name="ATL::CMapToAutoPtr&lt;*,*,*&gt;"/>
    <DisplayString>{{Count = {m_nElements}}}</DisplayString>
    <Expand>
      <Item Name="Count">m_nElements</Item>
      <ArrayItems>
        <Size>m_nBins</Size>
        <ValuePointer>m_ppBins</ValuePointer>
      </ArrayItems>
    </Expand>
  </Type>
  <Type Name="ATL::CAtlMap&lt;*,*,*,*&gt;::CNode">
    <DisplayString Condition="this==0">Empty bucket</DisplayString>
    <DisplayString Condition="this!=0">Hash table bucket</DisplayString>
  </Type>
</AutoVisualizer>

Visualizing Map is a bit tricky, and I didn’t take the time yet to look deep into it – but the file is hopefully useful as it is. To use, just save the text as, say, MfcContainers.natvis, either under %VSINSTALLDIR%\Common7\Packages\Debugger\Visualizers (requires admin access), or under %USERPROFILE%\My Documents\Visual Studio 2012\Visualizers\ .

NatStepFilter Files

- are the new and improved substitute for the NoStepInto registry key. While there are some online hints and traces, the natstepfilter spec is yet to be introduced into MSDN – or even the VC++ team blog. For now you can watch the format specification, along with some good comments, at the %VSINSTALLDIR%\Xml\Schemas\natstepfilter.xsd near you, or even better – inspect a small sample at %VSINSTALLDIR%\Common7\Packages\Debugger\Visualizers\default.natstepfilter.

The default.natstepfilter is implemented by Stephen T. Lavavej, and is very far from complete – both because of regex limitations and because of decisions not to set non-overridable limitations on users:

“Adding something to the default natstepfilter is a very aggressive move, because I don’t believe there’s an easy way for users to undo it (hacking the file requires admin access), and it may be surprising when the debugger just decides to skip stuff.”

I can think of several ways for users to override .natstepfilter directives (never mind stepping-into via assembly, how about setting a plain breakpoint it the function you wish to step into?) – and so I don’t agree with that decision. Still I hope the default rules would improve alongside the documentation. We mostly avoid STL, so I had no need to customize .natstepfilter’s yet – I’ll be sure to share such customizations if I do go there.

Caveat

Both improvements, natvis and natstepfilter files, do not work for debugging native/managed mixed code, which sadly renders them unusable for most of our code. While this behavior is documented – I would hardly say it is ‘by design’. It does seem to irritate many others, so there is hope – as Brad Sullivan writes that MS are-

“… working on making everything just work in a future release of Visual Studio.”

Entry Point Not Found, and other DLL Loading Problems

Occasionally I come across DLL load problems:

The verbosity of the error messages varies greatly. In their raw form these include at least the DLL name, but as various frameworks come into play (for the error message above, it’s .net) – native exceptions are caught and re-thrown, and more often than not helpful information is lost on the way.

Turns out there’s a built in way to get verbose windows-loader output: the Show Loader Snaps flag. The easiest way to mark it is with the gflags utility, bundled with debugging tools for windows:

Under the hood, it merely adds a FLG_SHOW_LDR_SNAPS flag (0×00000002), to a DWORD value in the relevant IFEO registry key. This in turn causes Windows Loader to set the _ShowSnaps variable in the ntdll copy specific to the named process.

And now, behold the new and shiny loader trace (dumped to the debugger output window):

…    

2724:245c @ 11813487 – LdrpFindOrMapDll – RETURN: Status: 0×00000000

2724:245c @ 11813487 – LdrpLoadImportModule – RETURN: Status: 0×00000000

2724:245c @ 11813487 – LdrpLoadImportModule – RETURN: Status: 0×00000000

2724:245c @ 11813487 – LdrpLoadImportModule – RETURN: Status: 0×00000000

2724:245c @ 11813487 – LdrpSnapThunk – WARNING: Hint index 0x70a for procedure “?Revert@CStreamMemory@@UAGJXZ” in DLL “YaddaYadda.dll” is invalid

2724:245c @ 11813487 – LdrpSnapThunk – ERROR: Procedure “?Revert@CStreamMemory@@UAGJXZ” could not be located in DLL “YaddaYadda.dll”

First-chance exception at 0x77321d32 (ntdll.dll) in Strategist.exe: 0xC0000139: Entry Point Not Found.

Bam! There’s the offending DLL and the offending imported function, right there in the debugger.

Like many other useful features – it is documented, but very low on discoverability. Which is a fancy way of saying you can find it only if you already know exactly what you are looking for. I personally got around to it after digging around in ntdll assembly (just like Matt Pietrek, 14 years ago), trying to get to a string containing the name of an offending DLL.

The windows-copycat-opensource ReactOS source gives a nice view of the internal usage of this flag – called ShowSnaps in their source. The ‘snapping’ verb in this context refers to one of the actions performed by the loader: after rebasing the loaded DLL in the loading process memory space, the DLL’s exported function addresses are updated and must be copied to the importing process (or other dll) Import Address Table. This – in this context – is called snapping, and that’s where the extra tracing is hooked.