We recently had trouble reproducing in debug builds the exact numerical behavior of release builds. This just happens occasionally, and we’re used to accept it as some compiler black magic. This time we dug a bit deeper – I’m pretty confident we understood all root causes.
1. Different code paths
Here’s a very real code snippet:
VECTOR& VECTOR::Normalize()
{
#if defined( _DEBUG )
D3DXVec3Normalize((D3DXVECTOR3*)this, (D3DXVECTOR3*)this);
#else
float norm2 = x * x + y * y + z * z;
__m128 val = _mm_load_ss( &norm2 );
_mm_store_ss( &norm2,
_mm_and_ps(
_mm_cmplt_ss( _mm_set_ss( EPSILON ), val ), _mm_rsqrt_ss( val )
) );
x *= norm2;
y *= norm2;
z *= norm2;
#endif
return *this;
}
Not only are the execution results numerically different, the _DEBUG path consistently runs 30% faster..
2. /arch
Release builds were compiled with /arch:SSE2, (project properties-> Configuration properties -> C/C++ -> Code Generation-> Enable enhanced instruction set) while debug builds weren’t.
This means release builds were aware of SSE instructions, and mostly favored them over FPU instructions. Even for scalar computations, SSE is usually faster – but by default it differs in computational precision (roughly, SSE: 32 bit, FPU: 80 bit).
3. Floating point model
Release builds were compiled with /fp:fast, and debug builds with /fp:precise. (Project properties-> Configuration properties -> C/C++ -> Code Generation-> Floating point model).
That’s a heavier issue, and in some sense the root cause. A thorough survey is here.
In a nutshell – the C++ standard is strict about order of operations, and about points of intermediate rounding (from register precision to final stack precision). For example, an expression like ‘a+b+c’ must be evaluated as ‘(a+b)+c’. a+(b+c) is usually different – since we’re working with finite precision floats: to see why, try a= 1.0, b = c = 2^(-24) (other examples at the link).
When you compile with /fp:fast, you explicitly say “dear compiler, I don’t care about this crap. When you see stuff like ‘a+b+c’, assume I don’t mind how you evaluate it – if I did, I’d put parenthesis”. The compiler is free to make many optimization choices – a comprehensive list is at the link.
When we set /fp:precise to release builds, (after the previous 2 fixes), all remaining numerical differences vanish (well, at least all those I tested). However in some scenarios I tested the performance penalty is tangible.
For now we set /fp:fast to all configurations (including debug) in all projects. Sadly, some numerical differences remain (the compiler does take different choices in optimized and non-optimized builds), but they seem 2 orders of magnitude smaller.
Bottom Line
After correcting for these factors, there is a much, much better chance of reproducing release-behaviour in debug builds. If there’s a behaviour you’re still unable to reproduce, try compiling both release & debug with /fp:precise and reproducing. This is the only official way of producing consistent results.
