Debug/Release Numerical Differences

We recently had trouble reproducing in debug builds the exact numerical behavior of release builds.   This just happens  occasionally, and we’re used to accept it as some compiler black magic.  This time we dug a bit deeper – I’m pretty confident we understood all root causes.

1.    Different code paths

Here’s a very real code snippet:

VECTOR& VECTOR::Normalize()

#if defined( _DEBUG )
 D3DXVec3Normalize((D3DXVECTOR3*)this, (D3DXVECTOR3*)this);
   float norm2 = x * x + y * y + z * z;
   __m128 val = _mm_load_ss( &norm2 );
   _mm_store_ss( &norm2,
             _mm_cmplt_ss( _mm_set_ss( EPSILON ), val ), _mm_rsqrt_ss( val )
                 )       );

   x *= norm2;
   y *= norm2;
   z *= norm2;


return *this;

Not only are the execution results numerically different, the _DEBUG path consistently runs 30% faster..

2.    /arch

Release builds were compiled with /arch:SSE2, (project properties-> Configuration properties -> C/C++ -> Code Generation-> Enable enhanced instruction set) while debug builds weren’t.

This means release builds were aware of SSE instructions, and mostly favored them over FPU instructions. Even for scalar computations, SSE is usually faster – but by default it differs in computational precision (roughly, SSE: 32 bit, FPU: 80 bit).

3.    Floating point model

Release builds were compiled with /fp:fast, and debug builds with /fp:precise. (Project properties-> Configuration properties -> C/C++ -> Code Generation-> Floating point model).

That’s a heavier issue, and in some sense the root cause. A thorough survey is here.

In a nutshell – the C++ standard is strict about order of operations, and about points of intermediate rounding (from register precision to final stack precision). For example, an expression like ‘a+b+c’ must be evaluated as ‘(a+b)+c’.   a+(b+c) is usually different – since we’re working with finite precision floats:  to see why, try  a= 1.0,  b = c = 2^(-24)  (other examples at the link).

When you compile with /fp:fast, you explicitly say “dear compiler, I don’t care about this crap. When you see stuff like ‘a+b+c’, assume I don’t mind how you evaluate it – if I did, I’d put parenthesis”.   The compiler is free to make many optimization choices – a comprehensive list is at the link.

When we set /fp:precise to release builds, (after the previous 2 fixes), all remaining numerical differences vanish (well, at least all those I tested).  However in some scenarios I tested the performance penalty is tangible.

For now we set /fp:fast to all configurations (including debug) in all projects. Sadly, some numerical differences remain (the compiler does take different choices in optimized and non-optimized builds), but they seem 2 orders of magnitude smaller.

Bottom Line

After correcting for these factors, there is a much, much better chance of reproducing release-behaviour in debug builds. If there’s a behaviour you’re still unable to reproduce, try compiling both release & debug with /fp:precise and reproducing. This is the only official way of producing consistent results.

This entry was posted in Debugging, VC++. Bookmark the permalink.

4 Responses to Debug/Release Numerical Differences

  1. jia103 says:

    Immediately looking at the different code paths above, I noticed that you have a #else with #ifdef _DEBUG. This is generally not a good coding practice and reminds me of something I once read:

    “… [Debug] code is _extra_ code, not different code. Unless there is a compelling reason not to, you should always execute the ship code… (Maguire, 58).”

    From _Writing Solid Code_ by Steve Maguire. Redmond: Microsoft Press, 1993.

    (Don’t let “Microsoft” throw you off; it’s actually a really good book, and that’s coming from a Linux and OSS user.)

    • Ofek Shilon says:

      Our app is game-like, meaning it is both real time and makes extensive use of 3D graphics. So on one hand we’re cramming inside tons of diagnostic rendering, and on the other hand we fight to sculpt away milisecond-fractions from each rendered frame. Thus, in our context we cannot generally concur to this otherwise reasonable advice.
      However, all this has nothing to do with the quoted bug. There is indeed no justification to branching on configuration *in a mathematical library*! It’s most probably some relic of an experiment that someone did and forgot to clean up. I’ve already scanned the neighbouring code to kill any such remaining branches.
      Thanks for the comment!

  2. Pingback: Speed optimization in release compiling « Go Henan

  3. Pingback: Speed optimization in release mode when compiling | iDetect Technologies

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s