x86/x64 Numerical differences – Correction

In a previous post a truncation scheme was suggested, to circumvent x86/x64 differences in math library implementations:

 
#pragma once 
#include <math.h>

inline double Truncate(double arg) 
{ 
   __int64 &ResInt = reinterpret_cast<__int64&> (arg); 
   ResInt &= 0xFFFFFFFFFFFFFFF8, // set the final 3 bits to zero 
   ResInt |= 4; // estimate the middle of the truncated range 
   double &roundedRes = reinterpret_cast (ResInt); 
   return roundedRes; 
} 

inline double MyCos(const double ang) 
{ 
   return Truncate(cos(ang)); 
}

#define cos MyCos
...
 

Since then, I accumulated some mileage with the scheme and have come to understand that line 8:

 
...
   ResInt |= 4; // estimate the middle of the truncated range 
...

-is flawed.

Since we drop the last 3 bits (8 ulps) of accuracy, it seemed like a good to fill the missing bits with the mid-value of the truncation range – thereby lowering the max truncation error from 7 ulps (truncating b111 to b000) to 4 ulps (truncating b000 to b100).

However, this lowers the average error only if you assume that inputs to these functions are uniformly distributed.

In other words, in real code you are far, far more likely to take cosine of 0 than cosine of 0.00000000000000003, so your average error would be better off if you hold back the |=4 sophistication, and just stick to the 000 suffix.

Even worse, in the wise-ass |=4 version, taking cosine of 0 would give a value slightly larger than 1, thereby potentially causing more subtle numerical difficulties than those saved.

All in all, currently my code uses the simple version:

 
#pragma once 
#include <math.h>

inline double Truncate(double arg) 
{ 
   __int64 &ResInt = reinterpret_cast<__int64&> (arg); 
   ResInt &= 0xFFFFFFFFFFFFFFF8; // set the final 3 bits to zero 
   double &roundedRes = reinterpret_cast (ResInt); 
   return roundedRes; 
}

inline double MyCos(const double ang) 
{ 
   return Truncate(cos(ang)); 
}

#define cos MyCos
...
 

Linker Weak Symbols

C++’s One-Definition-Rule roughly states that

In the entire program, an object or non-inline function cannot have more than one definition; if an object or function is used, it must have exactly one definition.

Which sounds like a good idea – until reality kicks in with all it’s hairy details.

How, for example, is it possible to overload global new(), or many other CRT overload-able functions?     If a function was decorated as inline but the optimizer decided not to inline it (a very common scenario) – it is included in multiple translation units.  Can a linker possibly handle that without breaking the ODR?

Enter weak symbols. In a nutshell:

During linking, a strong symbol can override a weak symbol of the same name. In contrast, 2 strong symbols that share a name yield a link error

Symbol, of course, can be either a function or extern variable. Unlike (most?) other compilers, VC++ does not expose an explicit way of declaring symbols as weak – but there are two alternatives that come close:

  1. __declspec(selectany), which directs the linker to select just one (any one) of multiple definitions for the symbol and discard the rest. MS explicitly state this as a quasi-answer for not exposing weak references to the programmer, but as a commenter notes this is not satisfying – one could hope to be able to declare a single implementation as *strong*, thus enforcing its selection at build time.
  2. The undocumented #pragma /alternatename, found in CRT sources and mentioned in this StackOverflow answer.  This one helps mimic a different weak-symbol functionality: initializing the symbol to zero if no definition is found.  This also hardly suffices as a replacement.

The VC++ toolchain does use weak symbols internally (i.e., the compiler generates them and the linker consumes them). You can inspect which symbols were treated as weak by running dumpbin /SYMBOLS on an obj file.   Typical output would be -

Section length   8C, #relocs    E, #linenums    0, checksum 9CA493CF, selection    5 (pick associative Section 0xA6)
Relocation CRC 4EF609B6
2B8 00000000 SECTAA notype       Static       | __ehfuncinfo$??0MyClass@@QAE@XZ
2B9 00000024 SECTAA notype       Static       | __unwindtable$??0MyClass@@QAE@XZ
2BA 00000000 UNDEF  notype ()    External     | __purecall
2BB 00000000 UNDEF  notype ()    External     | ??_GMyClass@@UAEPAXI@Z (public: virtual void * __thiscall MyClass::`scalar deleting destructor'(unsigned int))
2BC 00000000 UNDEF  notype ()    WeakExternal | ??_EMyClass@@UAEPAXI@Z (public: virtual void * __thiscall MyClass::`vector deleting destructor'(unsigned int))

Note the WeakExternal tag in the last line.
This snippet isn’t entirely random – it demonstrates another problem with choosing not to expose weak linkage to users: what do you do with compiler generated functions?   Stay tuned.

x86/x64 Library Numerical differences

There are many online sets of examples of 64 bit migration pitfalls, but I recently came across two that that appear not to be mentioned elsewhere.

First, downright compiler bugs.  We still have those and some raise their head only in 64.  (btw – my sincere apologies to Eric Brumer for venting out over him like that. He is not to blame for MS’s infuriating support policy).

Second and more importantly, different implementations of library math functions!

Here are two quick example results from VC++ 2013:

cos(0.37034934158424915),   on 32 gives 0.932200961410311 61, on 64 gives 0.93220096141031150.

cos(0.81476855148534799),   on 32 gives 0.686036806093662 47, on 64 gives 0.68603680609366235.

(on both cases, 32 was actually closer to the accurate results – but that’s probably a coincidence).

This is not the same as the compiler making different decisions on different platforms: the implementations of trigonometric functions were hand-crafted in assembly (at least in 32 bit), and each CRT version knowingly takes different code paths, based on exact platform and architecture (sometime based on run-time processor inspection).

These two examples are the bottom line of a several-day tedious debugging session.  This seemingly negligible difference manifested itself as a ~0.5% difference between results of a numerical optimization routine, in 32 and 64 bit VC++.

While not strictly a bug, this behaviour does make me uncomfortable in several aspects.

(1) Judging by some traces I compared during debugging, on ~95% of cases transcendental functions coincide exactly (to the last digit) on 32 and 64. Which makes one assume they were aiming for binary compatibility, and wonder whether the 5% difference is intentional.

(2) Stepping through the x64 implementation, it makes use of only vanilla SSE instructions, fully accessible to x86. There’s no technical reason limiting the implementations from coinciding.

(3) IEEE-754 had undergone a major overhaul in 2008, and the new version includes a much needed clause – still phrased as a recommendation. Quoting wikipedia:

…it recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language), and describes what needs to be done to achieve reproducible results.

I was hoping /fp:precise would have such an effect, but apparently it doesn’t.  As far as I can say, today the only way of achieving such reproducibility is by hand crafting your own function implementations.

Or if, like me, you can live without the last digits of precision, you can just make do without them.  I now include the following code in every file that uses trig/inverse-trig functions.
[Edit: see a fix in a newer post.]

// TruncatedFuncs.h

#pragma once
#include <math.h>

inline double Truncate(double arg)
{
	__int64 &ResInt = reinterpret_cast<__int64&> (arg);
			ResInt &= 0xFFFFFFFFFFFFFFF8,  // set the final 3 bits to zero
			ResInt |= 4;   // estimate the middle of the truncated range
	double	&roundedRes = reinterpret_cast<double&> (ResInt);

	return roundedRes;
}

inline double MyCos(const double ang)
{
	return Truncate(cos(ang));
}

inline double MySin(const double ang)
{
	return Truncate(sin(ang));
}

inline double MyTan(const double ang)
{
	return Truncate(tan(ang));
}

inline double MyAcos(const double ang)
{
	return Truncate(acos(ang));
}

inline double MyAsin(const double ang)
{
	return Truncate(asin(ang));
}

inline double MyAtan2(const double y, const double x)
{
	return Truncate(atan2(y, x));
}

#define cos MyCos
#define sin MySin
#define tan MyTan
#define acos MyAcos
#define asin MyAsin
#define atan2 MyAtan2

Solving rxinput.dll Crashes

Recently I started getting hit by access violations with weird stacks:

It seems an exception is thrown early in the life of some thread, during the initialization routine of rxinput.dll – a dll I never heard of. Naïve googling taught me pretty much nothing (except this dll already caused at least one more headache).

The dll location on disk was Program Files\NVIDIA corporation\NvStreamSvr, which gave a lead for some finer grained googling: it turns out this is part of a 2013 nVidia project called SHIELD, intended to stream games from a PC to other screens within Wi-Fi range. Sounds interesting, never heard of that too.

The nVidia streamer is managed by a system service:

…and so can be shut down. That didn’t help – rxinput.dll somehow kept getting injected into my executable and crashing it down (I saw detoured.dll in the same folder as rxinput.dll – most probably some win32 API was already hooked. In hindsight perhaps a restart would have helped). After disabling the service I was able to rename or delete rxinput and the crashing stopped – but I had no way of telling what component was destabilized in the process.

What eventually did the trick was uninstalling nVidia’s GeForce Experience – which is not only unsolicited (probably was bundled with some GPU driver), but is also still in beta. The entire NvStreamSvr folder is now gone, along with NVIDIA Streamer and Update Daemon services and my executable crashes.

Hope this saves someone out there some trouble.

Another Look at the VS2012 Auto Vectorizer

A while ago I did some experimenting with (than beta) VS2012. After these experiments our team migrated to the 2012 IDE but kept to the 2010 toolset. Since then much had happened: an official VS2012 launch + 4 updates, rather thorough documentation and quite a few online talks by the compiler team. It was high time to take another look at the 2012 toolset.

What I care about

Our scenario is somewhat unpleasant, but I suspect very typical in the enterprise world: we have a large (~800K LOC) C++ code base, with some legacy niches that date back 15+ years. The code is computationally intensive and sensitive to performance and yet extremely rich in branch logic. While C++ language advances are nice, what I care about are backend improvements – and specifically, auto vectorization.

Why you should care too

In the last decade or so, virtually 100% of the progress in the x86/x64 ISA was made in vector units on the processor*. The software side, however, is veeeery slow to catch up: to this day, making any use of SSE/AVX processor-treats requires non-standard, non-portable, hard, low level tweaks – that make economic sense only on specific market niches (say video editing & 3D). Moreover, as the years go by – even in these niches such costly tweaks make less and less sense, as this effort is probably better off invested in moving execution to the GPU.

If you think about this for a second – our industry is in a somewhat ridiculous state. For over 10 years, the leading general-purpose processor architecture is evolving in a direction that just isn’t useful to most software it runs!

This is exactly where auto-vectorization ought to come in.
 
To make use of all these virgin, luscious silicon fields, C++ compilers need to be extra clever. They must be able to reason about code – without any help from the code itself (well, not yet) – and automatically translate execution to SIMD units where possible. This is a tough job, and AFAIK until recently only Intel’s compiler was somewhat up to the task. (I passionately hate Intel’s compiler for different reasons, but that’s besides the point now). Only towards VS2012 did MS decide to try to catch up and add decent vectorization support, and if this effort succeeds – I truly believe it can revolutionize the SW industry, no less.

So, I dedicated two afternoons to build one of our products in VS2012 latest public bits with /Qvec-report:2 and try to understand how much of the vectorization potential is fulfilled.

Well, are we there yet?

No.

Only a negligible percentage of loops was vectorized successfully. A random inspection of reported vectorization failures shows that almost all of them are not due to real syntactic reasons. I created connect reports for some of the bogus failures – here are the technical details.

1. Copy operations are pretty much never vectorized. Even MS internal STL code -

_OutIt _Fill_n(_OutIt _Dest, _Diff _Count, const _Ty& _Val)
{    // copy _Val _Count times through [_Dest, ...)
   for (; 0 < _Count; --_Count, ++_Dest)
       *_Dest = _Val;
    return (_Dest);
}

- fails to vectorize, with reported reason being the generic 500. Even worse, the following snippet from MSDN itself:

void code_1300(int *A, int *B)
{
    // Code 1300 is emitted when the compiler detects that there is
    // no computation in the loop body.

    for (int i=0; i<1000; ++i)
    {
        A[i] = B[i]; // Do not vectorize, instead emit memcpy
    }
}

- fails to emit memcpy, as it supposedly does. Eric Brumer responds they are indeed working on this very problem.

2. Vectorization decisions by the compiler are extremely sensitive to unrelated code changes. Eric Brumer’s investigation – in the connect link – shows (IIUC) that vectorization decisions depend on previously made inlining decisions, which is what makes them highly fragile and undependable. The reported reasons for failures in these cases seem outright random. Again, they are working on it.

3. new-operator declaration syntax hides the fact that allocated buffers can be safely considered non-aliased. This might be for valid legal reasons, but comes at a formidable price: the entire alias-analysis done by the compiler is severely crippled, and major optimization opportunities (vectorization being just one) are thereby missed. Stephan Lavavej reports, quote, ‘the compiler back-end team has an active work item to investigate this’.

4. This one is just a quibble, really: turns out the report message ’1106 Inner loop is already vectorized. Cannot also vectorize the outer loop’ – is misleading. Outer loops are never vectorized, regardless of inner loops vectorization.

5. About vectorizer report ’1303 Too few loop iterations for vectorization to provide value’: this pretty much rules out any optimization to 2D/3D loops. Now, for one, Intel made a significant investment in short-vector-math-library (intrinsic to the compiler) to harvest potential speedups in just such cases. Second, there are quite a few hand-vectorized 3D libraries out there, so it seems others have also reached the conclusion this is a worthy optimization. So while I don’t have decisive quantitative data – I find this reported reason highly suspicious.

I came by more bumps and weird behaviour, but decided not to investigate any deeper than this.

Bottom Line

People infinitely smarter than me are investing tremendous effort in vectorization technology, and I’m sure it would grow to be impressive. That being said, marketing divisions inevitably work much faster than R&D – and it seems VS2012 is just the start of a ramp-up stage.

For us, VS upgrade is a major hassle. It involves cross team coordination, chasing third party vendors for new builds, and the unavoidable ironing of various wrinkles that come with any migration. I just can’t justify this hassle with any tangible added value for us**. We’ll most likely ‘go Vista’ on VS2012, and just skip it quietly.

I’m anxious to return and test the vectorizer again – but at least after some service-packs*** for VS2013 are out****.
__________________________________________________

* That’s not to say no other significant progress was made in the processor space itself – transistors got tiny, caches got enormous, memory controllers and graphic processors were integrated in, etc. etc. I am saying that practically all the architectural innovations in the last decade were SIMD extensions.

** Well, there is this undocumented goodie that holds some very tangible added value for us, but probably not enough. More on that some day.

*** I know! updates, updates.

**** Footnotes are fun. Just saying.

Find Where Types are Passed by Value

Say you’re working on a large code base, and you came across several instances where some type of non-negligible size was passed as argument by value – where it was more efficient to pass by const reference. Fixing a few occurrences is easy, but how can you efficiently list all places throughout the code where this happens?

Here’s a trick: define the type as aligned, and rebuild. The compiler would now shout exactly where the type is passed by value, since aligned types cannot be passed as such.

Double clicking every such error would get you immediately to where a fix is probably in order.

Discovering Which Projects Depend on a Project – II

In a previous post I shared a hack that enables detection of all projects that depend on a given one, either directly or indirectly. @Matt asks by mail if I can suggest a quick way to isolate only the direct dependencies.

Well as a matter of fact I can, but it would be even uglier than the original hack. First delete the project of interest from the solution:

Then note which of all the other projects have changed. This can be as easy as noting a small ‘v’ besides them in the solution explorer, indicating a check-out:

Turns out that Deleting a project (unlike, say, unloading it) chases it up in the references of all other solution sibling projects, and removes it from their references if present. This in turn causes a change to the project file, which can be easy to spot visually. Sibling projects which refer to the project indirectly still hold references to the intermediate projects – and so are left unchanged. Therefore this hack isolates only direct references.

Of course don’t forget to immediately undo these changes afterwards.

Altogether, these hacks are mighty hackish. If you find yourself caring about dependency management more than once or twice, just go get some tool.