x86/x64 Numerical differences – Correction

In a previous post a truncation scheme was suggested, to circumvent x86/x64 differences in math library implementations:

 
#pragma once 
#include <math.h>

inline double Truncate(double arg) 
{ 
   __int64 &ResInt = reinterpret_cast<__int64&> (arg); 
   ResInt &= 0xFFFFFFFFFFFFFFF8, // set the final 3 bits to zero 
   ResInt |= 4; // estimate the middle of the truncated range 
   double &roundedRes = reinterpret_cast (ResInt); 
   return roundedRes; 
} 

inline double MyCos(const double ang) 
{ 
   return Truncate(cos(ang)); 
}

#define cos MyCos
...
 

Since then, I accumulated some mileage with the scheme and have come to understand that line 8:

 
...
   ResInt |= 4; // estimate the middle of the truncated range 
...

-is flawed.

Since we drop the last 3 bits (8 ulps) of accuracy, it seemed like a good to fill the missing bits with the mid-value of the truncation range – thereby lowering the max truncation error from 7 ulps (truncating b111 to b000) to 4 ulps (truncating b000 to b100).

However, this lowers the average error only if you assume that inputs to these functions are uniformly distributed.

In other words, in real code you are far, far more likely to take cosine of 0 than cosine of 0.00000000000000003, so your average error would be better off if you hold back the |=4 sophistication, and just stick to the 000 suffix.

Even worse, in the wise-ass |=4 version, taking cosine of 0 would give a value slightly larger than 1, thereby potentially causing more subtle numerical difficulties than those saved.

All in all, currently my code uses the simple version:

 
#pragma once 
#include <math.h>

inline double Truncate(double arg) 
{ 
   __int64 &ResInt = reinterpret_cast<__int64&> (arg); 
   ResInt &= 0xFFFFFFFFFFFFFFF8; // set the final 3 bits to zero 
   double &roundedRes = reinterpret_cast (ResInt); 
   return roundedRes; 
}

inline double MyCos(const double ang) 
{ 
   return Truncate(cos(ang)); 
}

#define cos MyCos
...
 
Advertisements
This entry was posted in Algorithms, C++, VC++. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s