Take this toy loop:
#pragma parallel for for (int i = 0; i < 1000000; ++i) DoStuff();
Now suppose you want to run some preparation code once per thread – say, SetThreadName, or SetThreadPriority or whatnot. How would you go about that? If you code it before the loop the code would execute once, and if you code it inside the loop it would be executed 1000000 times.
Here’s a useful trick: private loop variables run their default constructor once per thread . Just package the action in some type’s constructor, and declare an object of that type as a private loop variable:
struct ThreadInit { ThreadInit() { SetThreadName("OMP thread"); SetThreadPriority(THREAD_PRIORITY_BELOW_NORMAL); } }; ... ThreadInit ti; #pragma parallel for private(ti) for (int i = 0; i < 1000000; ++i) DoStuff();
You can set a breakpoint in ThreadInit::ThreadInit() and watch it being executed exactly once in each thread. You can also enjoy the view at the threads window, as all OMP threads now have name and priority adjusted.
[Edit:] Improvement
The original object created before entering the parallel region – is just a dummy meant for duplication, but still runs its own constructor and destructor. This might be benign, as in the example above (redundant set of thread properties), but in my real life situation I used the ThreadInit object to create a thread-specific heap – and an extra heap is not an acceptable overhead.
Here’s another trick: as the spec says, a default constructor is called for the object copies in the parallel region. Just create the dummy object with a non-default constructor, and make sure the real action happens only in the default one. Here’s one way to do so (you can also code different ctors altogether):
struct ThreadInit { ThreadInit(bool bDummy = false) { if(!bDummy) { SetThreadName("OMP thread"); SetThreadPriority(THREAD_PRIORITY_BELOW_NORMAL); } } }; ... ThreadInit ti(true); #pragma parallel for private(ti) for (int i = 0; i < 1000000; ++i) DoStuff();
Hey Ofek
Have u tested that vs. boost::thread_specific_ptr ?
I experimented with direct TLS storage (“declspec(thread)”), which I believe thread_specific_ptr wraps. It is more cumbersome and does not directly execute code once per thread – you’d need to do something like atomically test the value of the TLS slot, act on it and update it *in every loop iteration* – which is exactly what I’m trying to avoid. Do you see another usage for boost::thread_specific_ptr here?
nope, I see what u mean. Thanks!
Post updated
this is much simpler + code with local thread variables on stack + join
int sum = 0;
#pragma omp parallel
{
printf (“in thread %d”, omp_get_thread_num());
int localSum =0; // local sum, avoid cross core caching issues
#pragma omp for
for (int i= 0; i< 1000; i++)
{
localSum += i*i;
}
// combine results
#pragma omp critical
sum += localSum;
}