Executing Code Once Per Thread in an OpenMP Loop

Take this toy loop:

#pragma parallel for
  for (int i = 0; i < 1000000; ++i)
    DoStuff();

Now suppose you want to run some preparation code once per thread – say, SetThreadName, or SetThreadPriority or whatnot.  How would you go about that? If you code it before the loop the code would execute once, and if you code it inside the loop it would be executed 1000000 times.

Here’s a useful trick: private loop variables run their default constructor once per thread . Just package the action in some type’s constructor, and declare an object of that type as a private loop variable:

struct ThreadInit
{
  ThreadInit()
  {
    SetThreadName("OMP thread");
    SetThreadPriority(THREAD_PRIORITY_BELOW_NORMAL);
  }
};

...
  ThreadInit ti;

#pragma parallel for private(ti)
  for (int i = 0; i < 1000000; ++i)
    DoStuff();

You can set a breakpoint in ThreadInit::ThreadInit() and watch it being executed exactly once in each thread. You can also enjoy the view at the threads window, as all OMP threads now have name and priority adjusted.

[Edit:] Improvement

The original object created before entering the parallel region – is just a dummy meant for duplication, but still runs its own constructor and destructor.  This might be benign, as in the example above (redundant set of thread properties), but in my real life situation I used the ThreadInit object to create a thread-specific heap - and an extra heap is not an acceptable overhead.

Here’s another trick: as the spec says, a default constructor is called for the object copies in the parallel region. Just create the dummy object with a non-default constructor, and make sure the real action happens only in the default one.   Here’s one way to do so (you can also code different ctors altogether):

struct ThreadInit
{
  ThreadInit(bool bDummy = false)
  {
    if(!bDummy)
    {
      SetThreadName("OMP thread");
      SetThreadPriority(THREAD_PRIORITY_BELOW_NORMAL);
    }
  }
};

...
  ThreadInit ti(true);

#pragma parallel for private(ti)
  for (int i = 0; i < 1000000; ++i)
    DoStuff();

 

On Vector Deleting Destructors and some new/delete internals

A word is due on vector deleting destructors – previously mentioned as the only functions that got weakly bound by the linker. The usual disclaimers apply: everything that follows is my own investigation, in code and online. Nothing here is official in any way, and even if I did get something right – it is subject for change at any time.

 

 

While C’s malloc/free deal purely with memory management, C++’s new/delete do more: they construct/destruct the objects being allocated/deallocated.  (Preemptive nitpick: there are other differences, they are not the subject of this post). There is a small family of compiler generated functions that help achieve these additional tasks:  vector constructor, scalar deleting destructor, vector deleting destructor, and vector ctor/dtor iterators.

The following toy code will be used to illustrate:

struct Whatever
{
	Whatever()  {};
	~Whatever() {};
};

int main(int argc, char* argv[])
{
	Whatever* pW = new Whatever;
	delete pW;

	Whatever* arrW = new Whatever[10];
	delete[] arrW;

	return 0;
}

new

when ‘new Whatever’ is executed, two things happen:

1) Memory is allocated, by a call to operator-new (which unless overridden, is essentially a wrapper around plain old malloc),

2) Whatever’s constructor is called.

Proof by a glimpse into unoptimized disassembly:

image

delete

When ‘single’ delete  is called on a Whatever pointer, the opposite happens in reverse order: first Whatever’s destructor is called, then operator-delete (which by default is equivalent to ‘free’) frees the unpopulated memory.  In this case, however, the compiler does not call ~Whatever() and operator-delete directly, but rather generates and invokes a helper function that wraps these two calls. This helper is called scalar deleting destructor – which makes sense, since it destructs and deletes.   Some more disassembly screenshots:

image

image

Why is the new+construction inlined and the delete+destruction wrapped in a helper?  Beats me. I would have thought that the exact same inlining tradeoff (binary size vs. call overhead) applies for both cases.

new[]  / delete[]

When the vector versions new[] and delete[] are called, an additional layer is added, to address the need to iterate over the Whatever object slots, and construct/destruct them one at a time.

Enter ‘vector constructor iterator’ and ‘vector destructor iterator’. In detail:

1) a new[] statement translates into a call to an operator-new with size enough to hold all Whatever’s, then a call to ‘eh vector constructor iterator’ which is essentially a for-loop of Whatever::Whatever()’s in the designated array locations.

2) A delete[] statement translates into a single call to vector deleting destructor, which in turn calls ‘eh vector destructor iterator’ and then operator delete.

Being merciful, I won’t hurt your eyes with more disassembly. Just believe me or go dig in yourselves.

Other findings, in no particular order

1) the ‘eh’ prefix in the vector ctor/dtor iterators stands for exception handling. If you compile with no c++ exceptions, a non-eh version of the iterators is emitted.  (This has nothing to do with std::nothrow, which controls the behaviour of operator new - a different stage of the object creation.)

2) The deleting destructors, both scalar and vector, are generated as hidden methods of the type Whatever.  All other helper functions (vector constructor, ctor/dtor iterators) are not.   Not sure why, but I suspect it has to do with a supposed need for weak linkage – more on that in a future post.

3) The compiler is smart enough to avoid generating and invoking unneeded helper functions. For example, comment out the coded ctor Whatever::Whatever(), and watch as the vector constructor call vanishes.

4) The vector deleting destructor is unique in that it has some built in flexibility. Raymond Chen spelled out pseudo code for it, which I shall shamelessly paste now:

void Whatever::vector deleting destructor(int flags)
{
  if (flags & 2) { // if vector destruct
    size_t* a = reinterpret_cast<size_t*>(this) - 1;
    size_t howmany = *a;
    vector destructor iterator(p, sizeof(Whatever),
      howmany, Whatever::~Whatever);
    if (flags & 1) { // if delete too
      operator delete(a);
    }
  } else { // else scalar destruct
    this->~Whatever(); // destruct one
    if (flags & 1) { // if delete too
      operator delete(this);
    }
  }
}

So from the vector deleting dtor’s viewpoint, memory deallocation is optional and the same function can serve as a scalar deleting dtor (when flags & 2 ==0) . In practice I have never seen a vector deleting destructor called with ‘flags’ different from 3 (i.e., vector, deleting destructor).  I can come up with somewhat contrived scenarios where this flexibility might be useful – say, a memory manager that wants to destroy objects but keep the memory for faster future usage. However, deleting dtors are accessible only to the compiler anyway, so the purpose of this flexibility is not clear to me.   Insights are very welcome.

Setting a Watch on Wide Registers in VS

General-purpose registers can be watched from the watch window pretty much as regular variables:

image

(the ‘$’ prefix is optional, but is recommended by MS – probably as means to minimize clashes with code variables.)

It is less known that you can set similar watches on SSE registers.  The direct approach doesn’t work:

image

- probably since the expression evaluator doesn’t have built in 128 bit types.  You can, however, set watches for specific portions of a wide register. First set a watch on a single float, with ‘xmm00’-like syntax (2nd number indicates the 32-bit slot to watch):

image

And next, you can watch 64-bit portions as doubles, with ‘xmm0dh’-like syntax:  ‘d’ stands for double, and l/h specifies high/low halves to watch.

image

This syntax went non-official after VS2003 (!). Up until VS2012 you could also watch 32-bit fractions of SSE registers as ints (some instructions use that) with ‘xmm0il’ syntax. This was mentioned in a Connect answer from 2009– but broke in VS2013.  From the VS2003 link it seems sometime around 2003 you could set similar watches to MMX registers, with ‘mm00’ like syntax.  Never seen it work on any VS version I used.  Maybe there’s similar syntax that enables AVX registers watch, hiding somewhere?   I don’t even have an AVX-enabled computer to guess on.


Update: the int watch (‘xmm0il’) syntax seems to be broken only for x64 builds.

 

x86/x64 Numerical differences – Correction

In a previous post a truncation scheme was suggested, to circumvent x86/x64 differences in math library implementations:

 
#pragma once 
#include <math.h>

inline double Truncate(double arg) 
{ 
   __int64 &ResInt = reinterpret_cast<__int64&> (arg); 
   ResInt &= 0xFFFFFFFFFFFFFFF8, // set the final 3 bits to zero 
   ResInt |= 4; // estimate the middle of the truncated range 
   double &roundedRes = reinterpret_cast (ResInt); 
   return roundedRes; 
} 

inline double MyCos(const double ang) 
{ 
   return Truncate(cos(ang)); 
}

#define cos MyCos
...
 

Since then, I accumulated some mileage with the scheme and have come to understand that line 8:

 
...
   ResInt |= 4; // estimate the middle of the truncated range 
...

-is flawed.

Since we drop the last 3 bits (8 ulps) of accuracy, it seemed like a good to fill the missing bits with the mid-value of the truncation range – thereby lowering the max truncation error from 7 ulps (truncating b111 to b000) to 4 ulps (truncating b000 to b100).

However, this lowers the average error only if you assume that inputs to these functions are uniformly distributed.

In other words, in real code you are far, far more likely to take cosine of 0 than cosine of 0.00000000000000003, so your average error would be better off if you hold back the |=4 sophistication, and just stick to the 000 suffix.

Even worse, in the wise-ass |=4 version, taking cosine of 0 would give a value slightly larger than 1, thereby potentially causing more subtle numerical difficulties than those saved.

All in all, currently my code uses the simple version:

 
#pragma once 
#include <math.h>

inline double Truncate(double arg) 
{ 
   __int64 &ResInt = reinterpret_cast<__int64&> (arg); 
   ResInt &= 0xFFFFFFFFFFFFFFF8; // set the final 3 bits to zero 
   double &roundedRes = reinterpret_cast (ResInt); 
   return roundedRes; 
}

inline double MyCos(const double ang) 
{ 
   return Truncate(cos(ang)); 
}

#define cos MyCos
...
 

Linker Weak Symbols

C++’s One-Definition-Rule roughly states that

In the entire program, an object or non-inline function cannot have more than one definition; if an object or function is used, it must have exactly one definition.

Which sounds like a good idea – until reality kicks in with all it’s hairy details.

How, for example, is it possible to overload global new(), or many other CRT overload-able functions?     If a function was decorated as inline but the optimizer decided not to inline it (a very common scenario) – it is included in multiple translation units.  Can a linker possibly handle that without breaking the ODR?

Enter weak symbols. In a nutshell:

During linking, a strong symbol can override a weak symbol of the same name. In contrast, 2 strong symbols that share a name yield a link error

Symbol, of course, can be either a function or extern variable. Unlike (most?) other compilers, VC++ does not expose an explicit way of declaring symbols as weak – but there are two alternatives that come close:

  1. __declspec(selectany), which directs the linker to select just one (any one) of multiple definitions for the symbol and discard the rest. MS explicitly state this as a quasi-answer for not exposing weak references to the programmer, but as a commenter notes this is not satisfying – one could hope to be able to declare a single implementation as *strong*, thus enforcing its selection at build time.
  2. The undocumented #pragma /alternatename, found in CRT sources and mentioned in this StackOverflow answer.  This one helps mimic a different weak-symbol functionality: initializing the symbol to zero if no definition is found.  This also hardly suffices as a replacement.

The VC++ toolchain does use weak symbols internally (i.e., the compiler generates them and the linker consumes them). You can inspect which symbols were treated as weak by running dumpbin /SYMBOLS on an obj file.   Typical output would be -

Section length   8C, #relocs    E, #linenums    0, checksum 9CA493CF, selection    5 (pick associative Section 0xA6)
Relocation CRC 4EF609B6
2B8 00000000 SECTAA notype       Static       | __ehfuncinfo$??0MyClass@@QAE@XZ
2B9 00000024 SECTAA notype       Static       | __unwindtable$??0MyClass@@QAE@XZ
2BA 00000000 UNDEF  notype ()    External     | __purecall
2BB 00000000 UNDEF  notype ()    External     | ??_GMyClass@@UAEPAXI@Z (public: virtual void * __thiscall MyClass::`scalar deleting destructor'(unsigned int))
2BC 00000000 UNDEF  notype ()    WeakExternal | ??_EMyClass@@UAEPAXI@Z (public: virtual void * __thiscall MyClass::`vector deleting destructor'(unsigned int))

Note the WeakExternal tag in the last line.
This snippet isn’t entirely random – it demonstrates another problem with choosing not to expose weak linkage to users: what do you do with compiler generated functions?   Stay tuned.

x86/x64 Library Numerical differences

There are many online sets of examples of 64 bit migration pitfalls, but I recently came across two that that appear not to be mentioned elsewhere.

First, downright compiler bugs.  We still have those and some raise their head only in 64.  (btw – my sincere apologies to Eric Brumer for venting out over him like that. He is not to blame for MS’s infuriating support policy).

Second and more importantly, different implementations of library math functions!

Here are two quick example results from VC++ 2013:

cos(0.37034934158424915),   on 32 gives 0.932200961410311 61, on 64 gives 0.93220096141031150.

cos(0.81476855148534799),   on 32 gives 0.686036806093662 47, on 64 gives 0.68603680609366235.

(on both cases, 32 was actually closer to the accurate results – but that’s probably a coincidence).

This is not the same as the compiler making different decisions on different platforms: the implementations of trigonometric functions were hand-crafted in assembly (at least in 32 bit), and each CRT version knowingly takes different code paths, based on exact platform and architecture (sometime based on run-time processor inspection).

These two examples are the bottom line of a several-day tedious debugging session.  This seemingly negligible difference manifested itself as a ~0.5% difference between results of a numerical optimization routine, in 32 and 64 bit VC++.

While not strictly a bug, this behaviour does make me uncomfortable in several aspects.

(1) Judging by some traces I compared during debugging, on ~95% of cases transcendental functions coincide exactly (to the last digit) on 32 and 64. Which makes one assume they were aiming for binary compatibility, and wonder whether the 5% difference is intentional.

(2) Stepping through the x64 implementation, it makes use of only vanilla SSE instructions, fully accessible to x86. There’s no technical reason limiting the implementations from coinciding.

(3) IEEE-754 had undergone a major overhaul in 2008, and the new version includes a much needed clause – still phrased as a recommendation. Quoting wikipedia:

…it recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language), and describes what needs to be done to achieve reproducible results.

I was hoping /fp:precise would have such an effect, but apparently it doesn’t.  As far as I can say, today the only way of achieving such reproducibility is by hand crafting your own function implementations.

Or if, like me, you can live without the last digits of precision, you can just make do without them.  I now include the following code in every file that uses trig/inverse-trig functions.
[Edit: see a fix in a newer post.]
[Edit 2: Thanks to my colleague Nadav Ben Abir for the conversation that eventually led to this code.]

// TruncatedFuncs.h

#pragma once
#include <math.h>

inline double Truncate(double arg)
{
	__int64 &ResInt = reinterpret_cast<__int64&> (arg);
			ResInt &= 0xFFFFFFFFFFFFFFF8,  // set the final 3 bits to zero
			ResInt |= 4;   // estimate the middle of the truncated range
	double	&roundedRes = reinterpret_cast<double&> (ResInt);

	return roundedRes;
}

inline double MyCos(const double ang)
{
	return Truncate(cos(ang));
}

inline double MySin(const double ang)
{
	return Truncate(sin(ang));
}

inline double MyTan(const double ang)
{
	return Truncate(tan(ang));
}

inline double MyAcos(const double ang)
{
	return Truncate(acos(ang));
}

inline double MyAsin(const double ang)
{
	return Truncate(asin(ang));
}

inline double MyAtan2(const double y, const double x)
{
	return Truncate(atan2(y, x));
}

#define cos MyCos
#define sin MySin
#define tan MyTan
#define acos MyAcos
#define asin MyAsin
#define atan2 MyAtan2

Solving rxinput.dll Crashes

Recently I started getting hit by access violations with weird stacks:

It seems an exception is thrown early in the life of some thread, during the initialization routine of rxinput.dll – a dll I never heard of. Naïve googling taught me pretty much nothing (except this dll already caused at least one more headache).

The dll location on disk was Program Files\NVIDIA corporation\NvStreamSvr, which gave a lead for some finer grained googling: it turns out this is part of a 2013 nVidia project called SHIELD, intended to stream games from a PC to other screens within Wi-Fi range. Sounds interesting, never heard of that too.

The nVidia streamer is managed by a system service:

…and so can be shut down. That didn’t help – rxinput.dll somehow kept getting injected into my executable and crashing it down (I saw detoured.dll in the same folder as rxinput.dll – most probably some win32 API was already hooked. In hindsight perhaps a restart would have helped). After disabling the service I was able to rename or delete rxinput and the crashing stopped – but I had no way of telling what component was destabilized in the process.

What eventually did the trick was uninstalling nVidia’s GeForce Experience – which is not only unsolicited (probably was bundled with some GPU driver), but is also still in beta. The entire NvStreamSvr folder is now gone, along with NVIDIA Streamer and Update Daemon services and my executable crashes.

Hope this saves someone out there some trouble.