x86/x64 Numerical differences – Correction

In a previous post a truncation scheme was suggested, to circumvent x86/x64 differences in math library implementations:

 
#pragma once 
#include <math.h>

inline double Truncate(double arg) 
{ 
   __int64 &ResInt = reinterpret_cast<__int64&> (arg); 
   ResInt &= 0xFFFFFFFFFFFFFFF8, // set the final 3 bits to zero 
   ResInt |= 4; // estimate the middle of the truncated range 
   double &roundedRes = reinterpret_cast (ResInt); 
   return roundedRes; 
} 

inline double MyCos(const double ang) 
{ 
   return Truncate(cos(ang)); 
}

#define cos MyCos
...
 

Since then, I accumulated some mileage with the scheme and have come to understand that line 8:

 
...
   ResInt |= 4; // estimate the middle of the truncated range 
...

-is flawed.

Since we drop the last 3 bits (8 ulps) of accuracy, it seemed like a good to fill the missing bits with the mid-value of the truncation range – thereby lowering the max truncation error from 7 ulps (truncating b111 to b000) to 4 ulps (truncating b000 to b100).

However, this lowers the average error only if you assume that inputs to these functions are uniformly distributed.

In other words, in real code you are far, far more likely to take cosine of 0 than cosine of 0.00000000000000003, so your average error would be better off if you hold back the |=4 sophistication, and just stick to the 000 suffix.

Even worse, in the wise-ass |=4 version, taking cosine of 0 would give a value slightly larger than 1, thereby potentially causing more subtle numerical difficulties than those saved.

All in all, currently my code uses the simple version:

 
#pragma once 
#include <math.h>

inline double Truncate(double arg) 
{ 
   __int64 &ResInt = reinterpret_cast<__int64&> (arg); 
   ResInt &= 0xFFFFFFFFFFFFFFF8; // set the final 3 bits to zero 
   double &roundedRes = reinterpret_cast (ResInt); 
   return roundedRes; 
}

inline double MyCos(const double ang) 
{ 
   return Truncate(cos(ang)); 
}

#define cos MyCos
...
 

Linker Weak Symbols

C++’s One-Definition-Rule roughly states that

In the entire program, an object or non-inline function cannot have more than one definition; if an object or function is used, it must have exactly one definition.

Which sounds like a good idea – until reality kicks in with all it’s hairy details.

How, for example, is it possible to overload global new(), or many other CRT overload-able functions?     If a function was decorated as inline but the optimizer decided not to inline it (a very common scenario) – it is included in multiple translation units.  Can a linker possibly handle that without breaking the ODR?

Enter weak symbols. In a nutshell:

During linking, a strong symbol can override a weak symbol of the same name. In contrast, 2 strong symbols that share a name yield a link error

Symbol, of course, can be either a function or extern variable. Unlike (most?) other compilers, VC++ does not expose an explicit way of declaring symbols as weak – but there are two alternatives that come close:

  1. __declspec(selectany), which directs the linker to select just one (any one) of multiple definitions for the symbol and discard the rest. MS explicitly state this as a quasi-answer for not exposing weak references to the programmer, but as a commenter notes this is not satisfying – one could hope to be able to declare a single implementation as *strong*, thus enforcing its selection at build time.
  2. The undocumented #pragma /alternatename, found in CRT sources and mentioned in this StackOverflow answer.  This one helps mimic a different weak-symbol functionality: initializing the symbol to zero if no definition is found.  This also hardly suffices as a replacement.

The VC++ toolchain does use weak symbols internally (i.e., the compiler generates them and the linker consumes them). You can inspect which symbols were treated as weak by running dumpbin /SYMBOLS on an obj file.   Typical output would be -

Section length   8C, #relocs    E, #linenums    0, checksum 9CA493CF, selection    5 (pick associative Section 0xA6)
Relocation CRC 4EF609B6
2B8 00000000 SECTAA notype       Static       | __ehfuncinfo$??0MyClass@@QAE@XZ
2B9 00000024 SECTAA notype       Static       | __unwindtable$??0MyClass@@QAE@XZ
2BA 00000000 UNDEF  notype ()    External     | __purecall
2BB 00000000 UNDEF  notype ()    External     | ??_GMyClass@@UAEPAXI@Z (public: virtual void * __thiscall MyClass::`scalar deleting destructor'(unsigned int))
2BC 00000000 UNDEF  notype ()    WeakExternal | ??_EMyClass@@UAEPAXI@Z (public: virtual void * __thiscall MyClass::`vector deleting destructor'(unsigned int))

Note the WeakExternal tag in the last line.
This snippet isn’t entirely random – it demonstrates another problem with choosing not to expose weak linkage to users: what do you do with compiler generated functions?   Stay tuned.

x86/x64 Library Numerical differences

There are many online sets of examples of 64 bit migration pitfalls, but I recently came across two that that appear not to be mentioned elsewhere.

First, downright compiler bugs.  We still have those and some raise their head only in 64.  (btw – my sincere apologies to Eric Brumer for venting out over him like that. He is not to blame for MS’s infuriating support policy).

Second and more importantly, different implementations of library math functions!

Here are two quick example results from VC++ 2013:

cos(0.37034934158424915),   on 32 gives 0.932200961410311 61, on 64 gives 0.93220096141031150.

cos(0.81476855148534799),   on 32 gives 0.686036806093662 47, on 64 gives 0.68603680609366235.

(on both cases, 32 was actually closer to the accurate results – but that’s probably a coincidence).

This is not the same as the compiler making different decisions on different platforms: the implementations of trigonometric functions were hand-crafted in assembly (at least in 32 bit), and each CRT version knowingly takes different code paths, based on exact platform and architecture (sometime based on run-time processor inspection).

These two examples are the bottom line of a several-day tedious debugging session.  This seemingly negligible difference manifested itself as a ~0.5% difference between results of a numerical optimization routine, in 32 and 64 bit VC++.

While not strictly a bug, this behaviour does make me uncomfortable in several aspects.

(1) Judging by some traces I compared during debugging, on ~95% of cases transcendental functions coincide exactly (to the last digit) on 32 and 64. Which makes one assume they were aiming for binary compatibility, and wonder whether the 5% difference is intentional.

(2) Stepping through the x64 implementation, it makes use of only vanilla SSE instructions, fully accessible to x86. There’s no technical reason limiting the implementations from coinciding.

(3) IEEE-754 had undergone a major overhaul in 2008, and the new version includes a much needed clause – still phrased as a recommendation. Quoting wikipedia:

…it recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language), and describes what needs to be done to achieve reproducible results.

I was hoping /fp:precise would have such an effect, but apparently it doesn’t.  As far as I can say, today the only way of achieving such reproducibility is by hand crafting your own function implementations.

Or if, like me, you can live without the last digits of precision, you can just make do without them.  I now include the following code in every file that uses trig/inverse-trig functions.
[Edit: see a fix in a newer post.]

// TruncatedFuncs.h

#pragma once
#include <math.h>

inline double Truncate(double arg)
{
	__int64 &ResInt = reinterpret_cast<__int64&> (arg);
			ResInt &= 0xFFFFFFFFFFFFFFF8,  // set the final 3 bits to zero
			ResInt |= 4;   // estimate the middle of the truncated range
	double	&roundedRes = reinterpret_cast<double&> (ResInt);

	return roundedRes;
}

inline double MyCos(const double ang)
{
	return Truncate(cos(ang));
}

inline double MySin(const double ang)
{
	return Truncate(sin(ang));
}

inline double MyTan(const double ang)
{
	return Truncate(tan(ang));
}

inline double MyAcos(const double ang)
{
	return Truncate(acos(ang));
}

inline double MyAsin(const double ang)
{
	return Truncate(asin(ang));
}

inline double MyAtan2(const double y, const double x)
{
	return Truncate(atan2(y, x));
}

#define cos MyCos
#define sin MySin
#define tan MyTan
#define acos MyAcos
#define asin MyAsin
#define atan2 MyAtan2

Solving rxinput.dll Crashes

Recently I started getting hit by access violations with weird stacks:

It seems an exception is thrown early in the life of some thread, during the initialization routine of rxinput.dll – a dll I never heard of. Naïve googling taught me pretty much nothing (except this dll already caused at least one more headache).

The dll location on disk was Program Files\NVIDIA corporation\NvStreamSvr, which gave a lead for some finer grained googling: it turns out this is part of a 2013 nVidia project called SHIELD, intended to stream games from a PC to other screens within Wi-Fi range. Sounds interesting, never heard of that too.

The nVidia streamer is managed by a system service:

…and so can be shut down. That didn’t help – rxinput.dll somehow kept getting injected into my executable and crashing it down (I saw detoured.dll in the same folder as rxinput.dll – most probably some win32 API was already hooked. In hindsight perhaps a restart would have helped). After disabling the service I was able to rename or delete rxinput and the crashing stopped – but I had no way of telling what component was destabilized in the process.

What eventually did the trick was uninstalling nVidia’s GeForce Experience – which is not only unsolicited (probably was bundled with some GPU driver), but is also still in beta. The entire NvStreamSvr folder is now gone, along with NVIDIA Streamer and Update Daemon services and my executable crashes.

Hope this saves someone out there some trouble.

Another Look at the VS2012 Auto Vectorizer

A while ago I did some experimenting with (than beta) VS2012. After these experiments our team migrated to the 2012 IDE but kept to the 2010 toolset. Since then much had happened: an official VS2012 launch + 4 updates, rather thorough documentation and quite a few online talks by the compiler team. It was high time to take another look at the 2012 toolset.

What I care about

Our scenario is somewhat unpleasant, but I suspect very typical in the enterprise world: we have a large (~800K LOC) C++ code base, with some legacy niches that date back 15+ years. The code is computationally intensive and sensitive to performance and yet extremely rich in branch logic. While C++ language advances are nice, what I care about are backend improvements – and specifically, auto vectorization.

Why you should care too

In the last decade or so, virtually 100% of the progress in the x86/x64 ISA was made in vector units on the processor*. The software side, however, is veeeery slow to catch up: to this day, making any use of SSE/AVX processor-treats requires non-standard, non-portable, hard, low level tweaks – that make economic sense only on specific market niches (say video editing & 3D). Moreover, as the years go by – even in these niches such costly tweaks make less and less sense, as this effort is probably better off invested in moving execution to the GPU.

If you think about this for a second – our industry is in a somewhat ridiculous state. For over 10 years, the leading general-purpose processor architecture is evolving in a direction that just isn’t useful to most software it runs!

This is exactly where auto-vectorization ought to come in.
 
To make use of all these virgin, luscious silicon fields, C++ compilers need to be extra clever. They must be able to reason about code – without any help from the code itself (well, not yet) – and automatically translate execution to SIMD units where possible. This is a tough job, and AFAIK until recently only Intel’s compiler was somewhat up to the task. (I passionately hate Intel’s compiler for different reasons, but that’s besides the point now). Only towards VS2012 did MS decide to try to catch up and add decent vectorization support, and if this effort succeeds – I truly believe it can revolutionize the SW industry, no less.

So, I dedicated two afternoons to build one of our products in VS2012 latest public bits with /Qvec-report:2 and try to understand how much of the vectorization potential is fulfilled.

Well, are we there yet?

No.

Only a negligible percentage of loops was vectorized successfully. A random inspection of reported vectorization failures shows that almost all of them are not due to real syntactic reasons. I created connect reports for some of the bogus failures – here are the technical details.

1. Copy operations are pretty much never vectorized. Even MS internal STL code -

_OutIt _Fill_n(_OutIt _Dest, _Diff _Count, const _Ty& _Val)
{    // copy _Val _Count times through [_Dest, ...)
   for (; 0 < _Count; --_Count, ++_Dest)
       *_Dest = _Val;
    return (_Dest);
}

- fails to vectorize, with reported reason being the generic 500. Even worse, the following snippet from MSDN itself:

void code_1300(int *A, int *B)
{
    // Code 1300 is emitted when the compiler detects that there is
    // no computation in the loop body.

    for (int i=0; i<1000; ++i)
    {
        A[i] = B[i]; // Do not vectorize, instead emit memcpy
    }
}

- fails to emit memcpy, as it supposedly does. Eric Brumer responds they are indeed working on this very problem.

2. Vectorization decisions by the compiler are extremely sensitive to unrelated code changes. Eric Brumer’s investigation – in the connect link – shows (IIUC) that vectorization decisions depend on previously made inlining decisions, which is what makes them highly fragile and undependable. The reported reasons for failures in these cases seem outright random. Again, they are working on it.

3. new-operator declaration syntax hides the fact that allocated buffers can be safely considered non-aliased. This might be for valid legal reasons, but comes at a formidable price: the entire alias-analysis done by the compiler is severely crippled, and major optimization opportunities (vectorization being just one) are thereby missed. Stephan Lavavej reports, quote, ‘the compiler back-end team has an active work item to investigate this’.

4. This one is just a quibble, really: turns out the report message ’1106 Inner loop is already vectorized. Cannot also vectorize the outer loop’ – is misleading. Outer loops are never vectorized, regardless of inner loops vectorization.

5. About vectorizer report ’1303 Too few loop iterations for vectorization to provide value’: this pretty much rules out any optimization to 2D/3D loops. Now, for one, Intel made a significant investment in short-vector-math-library (intrinsic to the compiler) to harvest potential speedups in just such cases. Second, there are quite a few hand-vectorized 3D libraries out there, so it seems others have also reached the conclusion this is a worthy optimization. So while I don’t have decisive quantitative data – I find this reported reason highly suspicious.

I came by more bumps and weird behaviour, but decided not to investigate any deeper than this.

Bottom Line

People infinitely smarter than me are investing tremendous effort in vectorization technology, and I’m sure it would grow to be impressive. That being said, marketing divisions inevitably work much faster than R&D – and it seems VS2012 is just the start of a ramp-up stage.

For us, VS upgrade is a major hassle. It involves cross team coordination, chasing third party vendors for new builds, and the unavoidable ironing of various wrinkles that come with any migration. I just can’t justify this hassle with any tangible added value for us**. We’ll most likely ‘go Vista’ on VS2012, and just skip it quietly.

I’m anxious to return and test the vectorizer again – but at least after some service-packs*** for VS2013 are out****.
__________________________________________________

* That’s not to say no other significant progress was made in the processor space itself – transistors got tiny, caches got enormous, memory controllers and graphic processors were integrated in, etc. etc. I am saying that practically all the architectural innovations in the last decade were SIMD extensions.

** Well, there is this undocumented goodie that holds some very tangible added value for us, but probably not enough. More on that some day.

*** I know! updates, updates.

**** Footnotes are fun. Just saying.

Find Where Types are Passed by Value

Say you’re working on a large code base, and you came across several instances where some type of non-negligible size was passed as argument by value – where it was more efficient to pass by const reference. Fixing a few occurrences is easy, but how can you efficiently list all places throughout the code where this happens?

Here’s a trick: define the type as aligned, and rebuild. The compiler would now shout exactly where the type is passed by value, since aligned types cannot be passed as such.

Double clicking every such error would get you immediately to where a fix is probably in order.

Discovering Which Projects Depend on a Project – II

In a previous post I shared a hack that enables detection of all projects that depend on a given one, either directly or indirectly. @Matt asks by mail if I can suggest a quick way to isolate only the direct dependencies.

Well as a matter of fact I can, but it would be even uglier than the original hack. First delete the project of interest from the solution:

Then note which of all the other projects have changed. This can be as easy as noting a small ‘v’ besides them in the solution explorer, indicating a check-out:

Turns out that Deleting a project (unlike, say, unloading it) chases it up in the references of all other solution sibling projects, and removes it from their references if present. This in turn causes a change to the project file, which can be easy to spot visually. Sibling projects which refer to the project indirectly still hold references to the intermediate projects – and so are left unchanged. Therefore this hack isolates only direct references.

Of course don’t forget to immediately undo these changes afterwards.

Altogether, these hacks are mighty hackish. If you find yourself caring about dependency management more than once or twice, just go get some tool.

Discovering Which Projects Depend on One

I am working with several large-ish (100+ project) solutions – and at this scale, dependency management is a very real issue. While you can easily view (and set) the dependencies of a project by viewing its references, there is no obvious tool to answer the reverse question: which projects depend on a given one?

Obviously a hack is in order. Enter project dependencies – project references predecessor. Both appear in the project context menu (from the solution explorer):

In a nutshell, dependencies are stored per-solution while references are stored per-project, as they should. But that’s beside the point here. The point is: the dependencies display is smart enough to keep you from forming cyclic dependencies. When you click ‘Project Dependencies…’ you’d see something like this:

The checked boxes indicate projects that the current one (selected in the top combo) includes either in its references or dependencies. The greyed out boxes (marked in a red rectangle here) indicate projects that include the current one in a similar manner. Indeed, if you try and check a greyed out box – thereby adding it to the current project dependencies – you get:

So there you have it: the list of greyed out boxes is a poor man’s answer to the question – which projects depend on the current one.

Note two limitations:

  1. These dependencies are both direct and indirect. Distinguishing these still requires some manual extra work.
  2. This hack applies only linker dependencies among projects, and is blind to dependency by header file inclusion. Generally speaking this amounts to dependency upon interfaces and not implementations (neglecting templates and other inlines), and so is a weaker form of dependency – but still one that might be of interest.

A few months ago I decided such hacks are no replacement for a proper tool, and started using CppDepend. It is not perfect, but I’m growing to like it. Maybe more on that in a future post – but in the meantime this hack should be useful to anyone working in large solutions like mine.

VS2012 Migration #3: autoexp and NoStepInto Replacements

In the past I blogged quite a few times about two immensely useful albeit mostly-unofficial debugger features: watch modification via autoexp.dat, and step-into modification via NoStepInto registry key. A long while ago I raised two suggestions at MS UserVoice, to invest in making these two semi-hacks into documented, supported features. The first suggestion got some traction, and is officially implemented in VS2012. The 2nd suggestion went mostly ignored – but nevertheless, there’s a new and better – though still undocumented – way to skip functions while stepping.

NatVis files

The Natvis (native-visualizers) file format is the shiny new replacement for autoexp.dat. It is well documented, and although still quite rough around the edges – bugs are accepted and treated, which means that for the first time it is actually supported. The new apparatus comes with several design advantages:

  1. It seems to be better isolated and not to crash the IDE so much,
  2. New visualizer debugging facilities are built in,
  3. Separate customized visualizers can be kept in separate files, allowing easier sharing (e.g., library writers can now share distribute .natvis files with their libraries).
  4. Natvis files can be placed at per-user locations.

It isn’t that much fun rehashing the syntax – being official and all – but I will include here a custom mfc-containers natvis, similar to the autoexp section I shared a while back

<?xml version="1.0" encoding="utf-8"?>
<AutoVisualizer xmlns="http://schemas.microsoft.com/vstudio/debugger/natvis/2010">
  <!--from afxwin.h -->
  <Type Name="CArray&lt;*,*&gt;">
    <AlternativeType Name="CObArray"></AlternativeType>
    <AlternativeType Name="CByteArray"></AlternativeType>
    <AlternativeType Name="CDWordArray"></AlternativeType>
    <AlternativeType Name="CPtrArray"></AlternativeType>
    <AlternativeType Name="CStringArray"></AlternativeType>
    <AlternativeType Name="CWordArray"></AlternativeType>
    <AlternativeType Name="CUIntArray"></AlternativeType>
    <AlternativeType Name="CTypedPtrArray&lt;*,*&gt;"></AlternativeType>
    <DisplayString>{{size = {m_nSize}}}</DisplayString>
    <Expand>
      <Item Name="[size]">m_nSize</Item>
      <Item Name="[capacity]">m_nMaxSize</Item>
      <ArrayItems>
        <Size>m_nSize</Size>
        <ValuePointer>m_pData</ValuePointer>
      </ArrayItems>
    </Expand>
  </Type>

  <Type Name="CList&lt;*,*&gt;">
    <AlternativeType Name="CObList"></AlternativeType>
    <AlternativeType Name="CPtrList"></AlternativeType>
    <AlternativeType Name="CStringList"></AlternativeType>
    <AlternativeType Name="CTypedPtrList&lt;*,*&gt;"></AlternativeType>
    <DisplayString>{{Count = {m_nCount}}}</DisplayString>
    <Expand>
      <Item Name="Count">m_nCount</Item>
      <LinkedListItems>
        <Size>m_nCount</Size>
        <HeadPointer>m_pNodeHead</HeadPointer>
        <NextPointer>pNext</NextPointer>
        <ValueNode>data</ValueNode>
      </LinkedListItems>
    </Expand>
  </Type>
  
  <Type Name="CMap&lt;*,*,*,*&gt;::CAssoc">
    <AlternativeType Name="CMapPtrToWord::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapPtrToPtr::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapStringToOb::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapStringToPtr::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapStringToString::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapWordToOb::CAssoc"></AlternativeType>
    <AlternativeType Name="CMapWordToPtr::CAssoc"></AlternativeType>
    <AlternativeType Name="CTypedPtrMap&lt;*,*,*&gt;::CAssoc"></AlternativeType>
    <DisplayString>{{key={key}, value={value}}}</DisplayString>
  </Type>

  <Type Name="CMap&lt;*,*,*,*&gt;">
    <AlternativeType Name="CMapPtrToWord"></AlternativeType>
    <AlternativeType Name="CMapPtrToPtr"></AlternativeType>
    <AlternativeType Name="CMapStringToOb"></AlternativeType>
    <AlternativeType Name="CMapStringToPtr"></AlternativeType>
    <AlternativeType Name="CMapStringToString"></AlternativeType>
    <AlternativeType Name="CMapWordToOb"></AlternativeType>
    <AlternativeType Name="CMapWordToPtr"></AlternativeType>
    <AlternativeType Name="CTypedPtrMap&lt;*,*,*&gt;"></AlternativeType>
    <DisplayString Condition="(m_nHashTableSize &gt;= 0 &amp;&amp; m_nHashTableSize &lt;= 65535">{{size={m_nHashTableSize}}}</DisplayString>
    <Expand>
      <Item Name="num bins">m_nHashTableSize</Item>
      <ArrayItems>
        <Size>m_nHashTableSize</Size>
        <ValuePointer>m_pHashTable</ValuePointer>
      </ArrayItems>
    </Expand>
  </Type>

  <Type Name="CMap&lt;*,*,*,*&gt;">
    <AlternativeType Name="CMapPtrToWord"></AlternativeType>
    <AlternativeType Name="CMapPtrToPtr"></AlternativeType>
    <AlternativeType Name="CMapStringToOb"></AlternativeType>
    <AlternativeType Name="CMapStringToPtr"></AlternativeType>
    <AlternativeType Name="CMapStringToString"></AlternativeType>
    <AlternativeType Name="CMapWordToOb"></AlternativeType>
    <AlternativeType Name="CMapWordToPtr"></AlternativeType>
    <AlternativeType Name="CTypedPtrMap&lt;*,*,*&gt;"></AlternativeType>
    <DisplayString>{Hash table too large!}</DisplayString>
  </Type>
  

  <Type Name="ATL::CAtlMap&lt;*,*,*,*&gt;">
    <AlternativeType Name="ATL::CMapToInterface&lt;*,*,*&gt;"/>
    <AlternativeType Name="ATL::CMapToAutoPtr&lt;*,*,*&gt;"/>
    <DisplayString>{{Count = {m_nElements}}}</DisplayString>
    <Expand>
      <Item Name="Count">m_nElements</Item>
      <ArrayItems>
        <Size>m_nBins</Size>
        <ValuePointer>m_ppBins</ValuePointer>
      </ArrayItems>
    </Expand>
  </Type>
  <Type Name="ATL::CAtlMap&lt;*,*,*,*&gt;::CNode">
    <DisplayString Condition="this==0">Empty bucket</DisplayString>
    <DisplayString Condition="this!=0">Hash table bucket</DisplayString>
  </Type>
</AutoVisualizer>

Visualizing Map is a bit tricky, and I didn’t take the time yet to look deep into it – but the file is hopefully useful as it is. To use, just save the text as, say, MfcContainers.natvis, either under %VSINSTALLDIR%\Common7\Packages\Debugger\Visualizers (requires admin access), or under %USERPROFILE%\My Documents\Visual Studio 2012\Visualizers\ .

NatStepFilter Files

- are the new and improved substitute for the NoStepInto registry key. While there are some online hints and traces, the natstepfilter spec is yet to be introduced into MSDN – or even the VC++ team blog. For now you can watch the format specification, along with some good comments, at the %VSINSTALLDIR%\Xml\Schemas\natstepfilter.xsd near you, or even better – inspect a small sample at %VSINSTALLDIR%\Common7\Packages\Debugger\Visualizers\default.natstepfilter.

The default.natstepfilter is implemented by Stephen T. Lavavej, and is very far from complete – both because of regex limitations and because of decisions not to set non-overridable limitations on users:

“Adding something to the default natstepfilter is a very aggressive move, because I don’t believe there’s an easy way for users to undo it (hacking the file requires admin access), and it may be surprising when the debugger just decides to skip stuff.”

I can think of several ways for users to override .natstepfilter directives (never mind stepping-into via assembly, how about setting a plain breakpoint it the function you wish to step into?) – and so I don’t agree with that decision. Still I hope the default rules would improve alongside the documentation. We mostly avoid STL, so I had no need to customize .natstepfilter’s yet – I’ll be sure to share such customizations if I do go there.

Caveat

Both improvements, natvis and natstepfilter files, do not work for debugging native/managed mixed code, which sadly renders them unusable for most of our code. While this behavior is documented – I would hardly say it is ‘by design’. It does seem to irritate many others, so there is hope – as Brad Sullivan writes that MS are-

“… working on making everything just work in a future release of Visual Studio.”