Executing Code Once Per Thread in an OpenMP Loop

Take this toy loop:

#pragma parallel for
  for (int i = 0; i < 1000000; ++i)
    DoStuff();

Now suppose you want to run some preparation code once per thread – say, SetThreadName, or SetThreadPriority or whatnot. How would you go about that? If you code it before the loop the code would execute once, and if you code it inside the loop it would be executed 1000000 times.

Here’s a useful trick: private loop variables run their default constructor once per thread . Just package the action in some type’s constructor, and declare an object of that type as a private loop variable:

struct ThreadInit
{
  ThreadInit()
  {
    SetThreadName("OMP thread");
    SetThreadPriority(THREAD_PRIORITY_BELOW_NORMAL);
  }
};

...
  ThreadInit ti;

#pragma parallel for private(ti)
  for (int i = 0; i < 1000000; ++i)
    DoStuff();

You can set a breakpoint in ThreadInit::ThreadInit() and watch it being executed exactly once in each thread. You can also enjoy the view at the threads window, as all OMP threads now have name and priority adjusted.

[Edit:] Improvement

The original object created before entering the parallel region – is just a dummy meant for duplication, but still runs its own constructor and destructor. This might be benign, as in the example above (redundant set of thread properties), but in my real life situation I used the ThreadInit object to create a thread-specific heap – and an extra heap is not an acceptable overhead.

Here’s another trick: as the spec says, a default constructor is called for the object copies in the parallel region. Just create the dummy object with a non-default constructor, and make sure the real action happens only in the default one. Here’s one way to do so (you can also code different ctors altogether):

struct ThreadInit
{
  ThreadInit(bool bDummy = false)
  {
    if(!bDummy)
    {
      SetThreadName("OMP thread");
      SetThreadPriority(THREAD_PRIORITY_BELOW_NORMAL);
    }
  }
};

...
  ThreadInit ti(true);

#pragma parallel for private(ti)
  for (int i = 0; i < 1000000; ++i)
    DoStuff();

This entry was posted in C++, VC++. Bookmark the permalink.

5 Responses to Executing Code Once Per Thread in an OpenMP Loop

Anonymous says:

June 10, 2014 at 8:08 am

Hey Ofek
Have u tested that vs. boost::thread_specific_ptr ?

- Ofek Shilon says:
  
  June 10, 2014 at 8:21 am
  
  I experimented with direct TLS storage (“declspec(thread)”), which I believe thread_specific_ptr wraps. It is more cumbersome and does not directly execute code once per thread – you’d need to do something like atomically test the value of the TLS slot, act on it and update it *in every loop iteration* – which is exactly what I’m trying to avoid. Do you see another usage for boost::thread_specific_ptr here?
  
Anonymous says:

June 10, 2014 at 8:25 am

nope, I see what u mean. Thanks!

- Ofek Shilon says:
  
  June 12, 2014 at 10:10 am
  
  Post updated
  
dshor says:

January 7, 2016 at 4:50 pm

this is much simpler + code with local thread variables on stack + join

int sum = 0;
#pragma omp parallel
{
printf (“in thread %d”, omp_get_thread_num());
int localSum =0; // local sum, avoid cross core caching issues
#pragma omp for
for (int i= 0; i< 1000; i++)
{
localSum += i*i;
}
// combine results
#pragma omp critical
sum += localSum;
}