Executing Code Once Per Thread in an OpenMP Loop

Take this toy loop:

#pragma parallel for
  for (int i = 0; i < 1000000; ++i)
    DoStuff();

Now suppose you want to run some preparation code once per thread – say, SetThreadName, or SetThreadPriority or whatnot.  How would you go about that? If you code it before the loop the code would execute once, and if you code it inside the loop it would be executed 1000000 times.

Here’s a useful trick: private loop variables run their default constructor once per thread . Just package the action in some type’s constructor, and declare an object of that type as a private loop variable:

struct ThreadInit
{
  ThreadInit()
  {
    SetThreadName("OMP thread");
    SetThreadPriority(THREAD_PRIORITY_BELOW_NORMAL);
  }
};

...
  ThreadInit ti;

#pragma parallel for private(ti)
  for (int i = 0; i < 1000000; ++i)
    DoStuff();

You can set a breakpoint in ThreadInit::ThreadInit() and watch it being executed exactly once in each thread. You can also enjoy the view at the threads window, as all OMP threads now have name and priority adjusted.

[Edit:] Improvement

The original object created before entering the parallel region – is just a dummy meant for duplication, but still runs its own constructor and destructor.  This might be benign, as in the example above (redundant set of thread properties), but in my real life situation I used the ThreadInit object to create a thread-specific heap – and an extra heap is not an acceptable overhead.

Here’s another trick: as the spec says, a default constructor is called for the object copies in the parallel region. Just create the dummy object with a non-default constructor, and make sure the real action happens only in the default one.   Here’s one way to do so (you can also code different ctors altogether):

struct ThreadInit
{
  ThreadInit(bool bDummy = false)
  {
    if(!bDummy)
    {
      SetThreadName("OMP thread");
      SetThreadPriority(THREAD_PRIORITY_BELOW_NORMAL);
    }
  }
};

...
  ThreadInit ti(true);

#pragma parallel for private(ti)
  for (int i = 0; i < 1000000; ++i)
    DoStuff();

 

Advertisements
This entry was posted in C++, VC++. Bookmark the permalink.

5 Responses to Executing Code Once Per Thread in an OpenMP Loop

  1. Anonymous says:

    Hey Ofek
    Have u tested that vs. boost::thread_specific_ptr ?

    • Ofek Shilon says:

      I experimented with direct TLS storage (“declspec(thread)”), which I believe thread_specific_ptr wraps. It is more cumbersome and does not directly execute code once per thread – you’d need to do something like atomically test the value of the TLS slot, act on it and update it *in every loop iteration* – which is exactly what I’m trying to avoid. Do you see another usage for boost::thread_specific_ptr here?

  2. Anonymous says:

    nope, I see what u mean. Thanks!

  3. dshor says:

    this is much simpler + code with local thread variables on stack + join

    int sum = 0;
    #pragma omp parallel
    {
    printf (“in thread %d”, omp_get_thread_num());
    int localSum =0; // local sum, avoid cross core caching issues
    #pragma omp for
    for (int i= 0; i< 1000; i++)
    {
    localSum += i*i;
    }
    // combine results
    #pragma omp critical
    sum += localSum;
    }

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s