Tinkering with VS2015 (CTP 6)

Today I downloaded the latest VS bits and played around with the native debugger. It was a brief session and so would be the records of my impressions.

J Universal CRT is here!

And seems like a great idea.

A lot of cheese was moved around in the process, and it would probably take me a while to know my way around again. As a prominent example, dbgint.h is now replaced by debug_heap.cpp – which is a borderline-breaking change: dbgint.h was kind-of-documented (although wrapped in disclaimers and admittedly an internal implementation detail), and real code came tumbling down. What’s worse, the type declarations that were available in dbgint.h are now hidden in debug_heap.cpp – which includes many un-published internal headers – and tool writers would probably have no choice but to cut and paste the type declarations and hope for the best.

I’m not entirely sure this breaking change (and others like it) are by design. One can still hope the final bits would see this fixed. What’s much worse –

L Published MS symbols are all stripped

Which means you can’t step through CRT/MFC sources. This is a major setback in productivity, and I hope only a temporary one.

J Context operator replaced

The context operator (‘{,,dll}symbol’) while being mighty useful at debug time, was broken beyond repair – and as I hoped in 2009 is replaced by the windbg-like ‘!’ operator:

However, as apparent in the screenshot:

L Context operator no longer deduces type

… and explicit casts are in order where they previously weren’t. That might seem like a quibble but this in fact prevents some very useful hacks previously available, notably checking memory integrity from the debugger:

The closest I currently have to a workaround is to capture the function to a variable in code, and invoke it from the watch window:

J Micro Profiling!

After you step past a code line in the debugger, a neat little tooltip appears:

Even in disassembly!

J Wide-register watch

‘xmm0il’ now works in x64 also.

BTW, the default platform is now ‘x86′ and ‘Win32′ is addable as a separate configuration from the config manager. Not sure why and what is the difference.

L Auto-vectorization

It seems little progress was made in auto vectorization – AFAICT all previous report still hold.

Accelerating Debug Runs, Part 2: _ITERATOR_DEBUG_LEVEL

A previous post discussed the Windows Debug Heap – with the main motivation being how to avoid it, as it is just empty, expensive overhead, and it isn’t clear why it is on by default in the first place. Remarkably, 3 weeks after posting here the VC team announced that from VC “14” (now in CTP) onwards the WDH will be opt-in and not opt-out, as it should be. So hopefully in ~2 years the recommendations in the last post will be largely obsolete. Anyway, I’m often facing unworkably slow run times in debug even after disabling the WDH, and further measures are in order.

In debug builds the VC++ CRT implementation runs a hefty load of iterator-validation tests on all STL containers – the simplest example being raising an assertion when an std::vector subscript is out of range. This leads to the unfortunate reality wherein if someone writes C++ code ‘by the book’, using standard containers for everything, s/he often ends up writing code that is utterly unusable in debug.  In one of our projects image classes were coded with std::vector as the container for the image bits. The product code is very intensive in iteration on pixels of many images, and as a result debug builds completed a typical job in ~4 hours, whereas a release build completed in ~4 minutes. For a long while debugging that project was reduced to logging and stepping in disassembly, as debug builds were completely useless.

Now for some good news and some bad news. The good news is that this behavior is customizable via the _ITERATOR_DEBUGGING_LEVEL macro: #define it to 0 (or 1, if you have a particular need for it) early in the compilation – say in the project properties or the top of a precompiled header– and this disproportional computational overhead is gone.

The bad news is that this doesn’t work.

MSVCMRTD.lib(locale0_implib.obj) : error LNK2022: metadata operation failed (8013118D) : Inconsistent layout information in duplicated types (std.basic_string<char,std::char_traits<char>,std::allocator<char> >): (0x0200004e).

Well that was a tad dramatic – it doesn’t always work, and in /clr builds in particular.

<rant>Now /clr projects will probably forever be second class citizens in the VC universe. Features will forever be coded for mainstream native code and will trickle down to /clr code as time and priorities permit (two notable examples are data breakpoints and debugger visualizers that are still unsupported in mixed debugging, but trust me – there are plenty more). </rant> Anyhow, as far as _ITERATOR_DEBUG_LEVEL goes – this is much more a bug than something resembling a decision. The venerable Stephan T. Lavavej elaborates (3rd reply from the bottom):

…The underlying problem is that _ITERATOR_DEBUG_LEVEL affects the representations of STL containers, and C++ (both native and especially managed) really hates it when code can’t agree on the representation of an object.  When _SECURE_SCL/_HAS_ITERATOR_DEBUGGING were added in VC8, we should have created 5 variants of the CRT/STL binaries (including DLLs).  Unfortunately we didn’t (this was before my time, otherwise I would have spoken up), and having only debug and release DLLs causes headaches.  We suffered from longstanding problems in VC8/9 until we overhauled how this worked in VC10.  During VC10 we untangled the worst of the problems by making std::string header-only.  With invasive surgery we were able to get native code working correctly in every case except one very obscure one that nobody has noticed or complained about yet.  (We now have 5 static libs, which solves the case of static linking absolutely 100% perfectly, but still only 2 DLLs.)  But managed code is structured differently, and the tricks that work in native don’t work for it.  As a result, customizing _ITERATOR_DEBUG_LEVEL basically doesn’t work under /clr[:pure].  Very few customers have encountered this (you’re one of the first) because we changed the release mode default to IDL=0 (which everyone wants), and few people want to modify debug mode’s default of IDL=2.

The thread is from Jan 2011 and this particular issue was resolved in VS2013. Similar issues remain, and I’m not sure /clr code would ever make it into routine test matrices in MS – so as the CRT code evolves, these issues would probably keep popping.

Bottom line: if – like me – you’re debugging C++ code that is both managed and makes extensive use of STL – your mileage may seriously vary when trying to customize iterator debug level. If you do develop purely native code and are trying to accelerate debug runs, I do recommend judiciously setting _ITERATOR_DEBUG_LEVEL to zero – and raising it back only when you’re tracking concrete iterator issues.

In the same reply Stephan offers an alternative:

Have you considered making your “debug build” compile in release mode with no optimizations?  Release mode versus debug mode affects what CRT/STL/etc. you link to (and whether you can effectively debug into them) and as a side effect affects your IDL default, but it’s not inherently tied to whether your own code is compiled with optimizations or not, and that’s what affects the debuggability of your own code.  The IDE pairs release mode with optimizations and debug mode without optimizations, but there’s no fundamental reason linking the two.

This makes sense and I briefly experimented with this approach. Sorry to report, still no success. While I still can’t pinpoint the root cause, even when compiling release builds with /Od (optimizations disabled) the debugging experience is severely crippled. (and yes, of course I raised the proper PDB generation switches in both the compiler and the linker). Local variable watch and single-steps seemed highly erratic, ‘this’ pointer for class methods seemed to stick with $rcx throughout the method and thus give rubbish on member watch – etc. etc.

However, this is a step in a better direction. More on that in the next post.

Debugging Memory Corruption II

Some years ago I shared a trick that let’s you call _CrtCheckMemory from the debugger anywhere, without re-compilation.   The updated (as of VS2013) string to type at a watch window is:

{,,msvcr120d.dll}_CrtCheckMemory()

Let’s expand on that today, in two steps.

Checking memory on every allocation

The CRT heap accepts a neat little flag, called: _CRTDBG_CHECK_ALWAYS_DF.  Here’s how it used:

int main()
{
// Get current flag
int tmpFlag = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);

// Turn on corruption-checking bit
tmpFlag |= _CRTDBG_CHECK_ALWAYS_DF;

// Set flag to the new value
_CrtSetDbgFlag(tmpFlag);

int* p = new int[100]; // allocate,
p[101] = 1;   // corrupt,    and…

int* q = new int[100];  // BOOM! alarm fires here

}

Testing for corruption on every allocation can tangibly slow down your program, which is why the CRT allows testing only every N allocations, N being 16, 128 or 1024.  Usage adds half a line of code – pasted from MSDN:

// Get the current bits
tmp = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);

// Clear the upper 16 bits and OR in the desired frequency
tmp = (tmp &amp; 0x0000FFFF) | _CRTDBG_CHECK_EVERY_16_DF;

// Set the new bits
_CrtSetDbgFlag(tmp);
}

Note that testing for corruption on every memory allocation is nothing like testing on every memory write – the alarm would not fire at the exact time of the felony, but since your software allocates memory (even indirectly) very often – this will hopefully help narrow down the crime scene quickly.

Checking memory on every allocation – from the debugger

You might reasonably want to enable/disable these lavish tests at runtime.

The debug flags are stored in {,,msvcr120d}_crtDbgFlag, and the numeric value of _CRTDBG_CHECK_ALWAYS_DF is 4, so one might hope that these lines would enable and disable these intensive memory tests:

image

Alas, this doesn’t work – _CrtSetDbgFlag contains further logic that routes the input flags further to internal variables. The easiest solution is to just call it:

image

First two lines enable, last two lines disable.  If you’re running with non default flags, the actual values you’d see might be different.

UseDebugLibraries and Wrong Defaults for VC++ Project Properties

Many of the projects I’m working on seem to have wrong default properties in Debug configuration.  For example, ‘Runtime Library’ is explicitly set to /MDd but defaults to /MD. ‘Basic Runtime Checks’ is explicitly set to /RTC1 but defaults to  none. ‘Optimization’ is explicitly set to /Od but defaults to /O2, and so on:

image

image

This recently caused us some trouble, and the investigation results are dumped below.

The direct reason is that these vcxproj’s are missing the ‘UseDebugLibraries’ element, under the ‘Configuration’ PropertyGroup: it should be set to true in Debug and false in Release.   A correct vcxproj should include some elements like –

<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
    <ConfigurationType>StaticLibrary</ConfigurationType>
    <UseDebugLibraries>true</UseDebugLibraries>
    <PlatformToolset>v120</PlatformToolset>
    <CharacterSet>Unicode</CharacterSet>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
    <ConfigurationType>StaticLibrary</ConfigurationType>
    <UseDebugLibraries>false</UseDebugLibraries>
    <PlatformToolset>v120</PlatformToolset>
    <CharacterSet>Unicode</CharacterSet>
</PropertyGroup>

Most ‘Configuration’ sub-elements (CharacterSet, ConfigurationType etc.) directly control import of custom property sheets, but UseDebugLibraries doesn’t. Instead, it is expected in various hooks around regular property sheets. For example, Microsoft.Cpp.mfcDynamic.props includes the following –

<ClCompile>
<RuntimeLibrary Condition="'$(UseDebugLibraries)' != 'true'">MultiThreadedDll</RuntimeLibrary>
<RuntimeLibrary Condition="'$(UseDebugLibraries)' == 'true'">MultiThreadedDebugDll</RuntimeLibrary>
</ClCompile>

Why UseDebugLibraries was missing from some libraries and present in others remained a mystery until I noticed that the younger libraries tended to have this element. Indeed, the real culprit is the migration from VS2008- (vcproj format) to VS2010+ (vcxproj/MSBuild format).  MS’s migration code just did not add this element. The generated projects are functional – they just explicitly set every individual compilation switch affected by UseDebugLibraries, which makes it overly verbose and a bit sensitive – especially in the presence of junior devs who tend to stick to defaults…

So every library you have which is 4Y+ old is susceptible to this migration bug, and I suggest you manually add UseDebugLibraries.  If you have a central prop sheet where you can control multiple projects – add it there.

Not much point in reporting this to MS, is there? The chances of a fix are practically zero, and the issue would get equal web-presence here.

Vector Deleting Destructor and Weak Linkage

Now that the discussions on weak linker symbols and vector deleting destructors are in place, it is time to discuss a fact that might seem esoteric but has far reaching implications. After that, it is time to ask for your help.

In VC++, Vector deleting destructors are defined with weak linkage at the translation unit that defined the class, and strong linkage at any translation unit that calls new[] on the class.

Say what?

The first part of this statement (v-d-dtors have weak linkage) was already demonstrated at the post on weak linkage – given any cpp file which defines a non trivial class, you can dumpbin its obj file and see for yourself.

Now some code to demonstrate the full statement:

 
//C.h 
struct C 
{
  virtual ~C(); 
}

//C.cpp 
#include "C.h" 
C::~C() {} 

//D.h 
struct D 
{ 
Func(); 
} 

//D.cpp 
#include "D.h" 
#include "C.h" 
D::Func() 
{ 
  C* = new C[42]; 
} 

A dumpbin of C.obj shows:

017 00000000 UNDEF  notype ()    External     | ??3@YAXPAX@Z (void __cdecl operator delete(void *))
018 00000000 SECT4  notype ()    External     | ??1C@@UAE@XZ (public: virtual __thiscall C::~C(void))
019 00000000 SECT6  notype ()    External     | ??_GC@@UAEPAXI@Z (public: virtual void * __thiscall C::`scalar deleting destructor'(unsigned int))
01A 00000000 UNDEF  notype ()    WeakExternal | ??_EC@@UAEPAXI@Z (public: virtual void * __thiscall C::`vector deleting destructor'(unsigned int))

While a dumpbin of D.obj shows:

01D 00000000 UNDEF  notype ()    External     | ??_L@YGXPAXIHP6EX0@Z1@Z (void __stdcall `eh vector constructor iterator'(void *,unsigned int,int,void (__thiscall*)(void *),void (__thiscall*)(void *)))
01E 00000000 UNDEF  notype ()    External     | ??_M@YGXPAXIHP6EX0@Z@Z (void __stdcall `eh vector destructor iterator'(void *,unsigned int,int,void (__thiscall*)(void *)))
01F 00000000 UNDEF  notype ()    External     | ??2@YAPAXI@Z (void * __cdecl operator new(unsigned int))
020 00000000 UNDEF  notype ()    External     | ??3@YAXPAX@Z (void __cdecl operator delete(void *))
021 00000000 SECT8  notype ()    External     | ?Func@D@@QAEXXZ (public: void __thiscall D::Func(void))
022 00000000 UNDEF  notype ()    External     | ??1C@@UAE@XZ (public: virtual __thiscall C::~C(void))
023 00000000 SECT4  notype ()    External     | ??0C@@QAE@XZ (public: __thiscall C::C(void))
024 00000000 SECT6  notype ()    External     | ??_EC@@UAEPAXI@Z (public: virtual void * __thiscall C::`vector deleting destructor'(unsigned int))

What this means is that to successfully complete the linkage of C.obj, the linker must now load D.obj – because both contain implementations of the same function, but C defines a weak external implementation and D defines a strong external implementation (of a C method!).

Ok, that’s kinda weird, but why should I care?

Here’s why:

What happens when C.cpp and D.cpp are part of a static library?

Unlike executables (.exe or .dll), when processing a static lib the linker only loads obj files that are referenced, i.e., whose contents are needed for successful linkage. Once loaded, an obj file must have it’s contents successfully link (unless you’re building with /GL, but let’s ignore that here). Let’s expand the previous example a bit :

//main.cpp
#include "StaticLib\C.h"

int main(int, char)
{
  C c;
  return 0;
}

//StaticLib\C.h 
struct C 
{
  virtual ~C(); 
}

//StaticLib\C.cpp 
#include "C.h" 
C::~C() {} 

//StaticLib\D.h 
struct D 
{ 
  Func(); 
}

//StaticLib\D.cpp 
#include "D.h" 
#include "C.h" 

extern void SomeJunkImplementedElsewhere();
D::Func() 
{ 
  C* = new C[42]; 
  SomeJunkImplementedElsewhere();
}

Can you already see what happens now?

Now for the program to successfully build you must satisfy D.cpp’s linkage – which means dragging in another library – although you never consumed D’s functionality in the first place.

I wish this was just a theoretical peculiarity. The solutions I’m working on consist of a complicated network of literally hundreds of static libraries, and time and time again we find ourselves forced to drag in weird dependencies that the code we actually run never uses.  It seems unbelievable, but almost all of these unexplainable dependencies boil down to this esoteric fact – vector deleting destructors have weak linkage at the point of class definition.

That was nice. Now go and report it.

I did. Over half a year ago.   The report was originally closed as ‘By Design’, and after an explicit request the following explanation from Karl Niu arrived:

To explain the “By Design” resolution, imagine that you have “new A[n]” and “delete[] pA” in different translation units. In such a case, the compiler needs to define the strong external in the translation unit containing the “new A[n]”.

Which I just don’t understand: the weak/strong debate is not over new[] or delete[], but rather over vector deleting destructors, which are not user-overridable in the first place. Wherever delete[] is overloaded, it should be able to fetch the vector-deleting-dtor from the translation unit that defined it – hopefully, the one that defined the class it’s deleting.   I tried to ask again, twice, and got no response for 6 months now.

Now, I regularly report many bugs at MS Connect, almost all of which never get resolved (which I can live with. I’m doing this mostly in hope of helping fellow devs googling their trouble) – but this one leaves me frustrated. It feels as if despite my best efforts I failed to clearly communicate the issue.    It seems like an esoteric technicality, yet it actively hinders decoupling – thereby damaging large software systems at the architecture level!

Why golly Ofek, that’s really bad. But what can I do?

You can either –

(1) Dig in and tell me in the comments where I’m wrong.  It was initially resolved as ‘by design’, and even got an explanation (sorta), so I might be missing some valid reason for this sorry state of affairs.

(2) Go to the bug page and upvote it.  This one realy deserves attention from the VC++ team.

But I urge you to do either.  Thanks!

red-pill-or-blue-pill

Executing Code Once Per Thread in an OpenMP Loop

Take this toy loop:

#pragma parallel for
  for (int i = 0; i < 1000000; ++i)
    DoStuff();

Now suppose you want to run some preparation code once per thread – say, SetThreadName, or SetThreadPriority or whatnot.  How would you go about that? If you code it before the loop the code would execute once, and if you code it inside the loop it would be executed 1000000 times.

Here’s a useful trick: private loop variables run their default constructor once per thread . Just package the action in some type’s constructor, and declare an object of that type as a private loop variable:

struct ThreadInit
{
  ThreadInit()
  {
    SetThreadName("OMP thread");
    SetThreadPriority(THREAD_PRIORITY_BELOW_NORMAL);
  }
};

...
  ThreadInit ti;

#pragma parallel for private(ti)
  for (int i = 0; i < 1000000; ++i)
    DoStuff();

You can set a breakpoint in ThreadInit::ThreadInit() and watch it being executed exactly once in each thread. You can also enjoy the view at the threads window, as all OMP threads now have name and priority adjusted.

[Edit:] Improvement

The original object created before entering the parallel region – is just a dummy meant for duplication, but still runs its own constructor and destructor.  This might be benign, as in the example above (redundant set of thread properties), but in my real life situation I used the ThreadInit object to create a thread-specific heap – and an extra heap is not an acceptable overhead.

Here’s another trick: as the spec says, a default constructor is called for the object copies in the parallel region. Just create the dummy object with a non-default constructor, and make sure the real action happens only in the default one.   Here’s one way to do so (you can also code different ctors altogether):

struct ThreadInit
{
  ThreadInit(bool bDummy = false)
  {
    if(!bDummy)
    {
      SetThreadName("OMP thread");
      SetThreadPriority(THREAD_PRIORITY_BELOW_NORMAL);
    }
  }
};

...
  ThreadInit ti(true);

#pragma parallel for private(ti)
  for (int i = 0; i < 1000000; ++i)
    DoStuff();