Another Look at the VS2012 Auto Vectorizer

A while ago I did some experimenting with (than beta) VS2012. After these experiments our team migrated to the 2012 IDE but kept to the 2010 toolset. Since then much had happened: an official VS2012 launch + 4 updates, rather thorough documentation and quite a few online talks by the compiler team. It was high time to take another look at the 2012 toolset.

What I care about

Our scenario is somewhat unpleasant, but I suspect very typical in the enterprise world: we have a large (~800K LOC) C++ code base, with some legacy niches that date back 15+ years. The code is computationally intensive and sensitive to performance and yet extremely rich in branch logic. While C++ language advances are nice, what I care about are backend improvements – and specifically, auto vectorization.

Why you should care too

In the last decade or so, virtually 100% of the progress in the x86/x64 ISA was made in vector units on the processor*. The software side, however, is veeeery slow to catch up: to this day, making any use of SSE/AVX processor-treats requires non-standard, non-portable, hard, low level tweaks – that make economic sense only on specific market niches (say video editing & 3D). Moreover, as the years go by – even in these niches such costly tweaks make less and less sense, as this effort is probably better off invested in moving execution to the GPU.

If you think about this for a second – our industry is in a somewhat ridiculous state. For over 10 years, the leading general-purpose processor architecture is evolving in a direction that just isn’t useful to most software it runs!

This is exactly where auto-vectorization ought to come in.

To make use of all these virgin, luscious silicon fields, C++ compilers need to be extra clever. They must be able to reason about code – without any help from the code itself (well, not yet) – and automatically translate execution to SIMD units where possible. This is a tough job, and AFAIK until recently only Intel’s compiler was somewhat up to the task. (I passionately hate Intel’s compiler for different reasons, but that’s besides the point now). Only towards VS2012 did MS decide to try to catch up and add decent vectorization support, and if this effort succeeds – I truly believe it can revolutionize the SW industry, no less.

So, I dedicated two afternoons to build one of our products in VS2012 latest public bits with /Qvec-report:2 and try to understand how much of the vectorization potential is fulfilled.

Well, are we there yet?

No.

Only a negligible percentage of loops was vectorized successfully. A random inspection of reported vectorization failures shows that almost all of them are not due to real syntactic reasons. I created connect reports for some of the bogus failures – here are the technical details.

1. Copy operations are pretty much never vectorized. Even MS internal STL code –

_OutIt _Fill_n(_OutIt _Dest, _Diff _Count, const _Ty& _Val)
{    // copy _Val _Count times through [_Dest, ...)
   for (; 0 < _Count; --_Count, ++_Dest)
       *_Dest = _Val;
    return (_Dest);
}

– fails to vectorize, with reported reason being the generic 500. Even worse, the following snippet from MSDN itself:

void code_1300(int *A, int *B)
{
    // Code 1300 is emitted when the compiler detects that there is
    // no computation in the loop body.

    for (int i=0; i<1000; ++i)
    {
        A[i] = B[i]; // Do not vectorize, instead emit memcpy
    }
}

– fails to emit memcpy, as it supposedly does. Eric Brumer responds they are indeed working on this very problem.

2. Vectorization decisions by the compiler are extremely sensitive to unrelated code changes. Eric Brumer’s investigation – in the connect link – shows (IIUC) that vectorization decisions depend on previously made inlining decisions, which is what makes them highly fragile and undependable. The reported reasons for failures in these cases seem outright random. Again, they are working on it.

3. new-operator declaration syntax hides the fact that allocated buffers can be safely considered non-aliased. This might be for valid legal reasons, but comes at a formidable price: the entire alias-analysis done by the compiler is severely crippled, and major optimization opportunities (vectorization being just one) are thereby missed. Stephan Lavavej reports, quote, ‘the compiler back-end team has an active work item to investigate this’.

4. This one is just a quibble, really: turns out the report message ‘1106 Inner loop is already vectorized. Cannot also vectorize the outer loop’ – is misleading. Outer loops are never vectorized, regardless of inner loops vectorization.

5. About vectorizer report ‘1303 Too few loop iterations for vectorization to provide value’: this pretty much rules out any optimization to 2D/3D loops. Now, for one, Intel made a significant investment in short-vector-math-library (intrinsic to the compiler) to harvest potential speedups in just such cases. Second, there are quite a few hand-vectorized 3D libraries out there, so it seems others have also reached the conclusion this is a worthy optimization. So while I don’t have decisive quantitative data – I find this reported reason highly suspicious.

I came by more bumps and weird behaviour, but decided not to investigate any deeper than this.

Bottom Line

People infinitely smarter than me are investing tremendous effort in vectorization technology, and I’m sure it would grow to be impressive. That being said, marketing divisions inevitably work much faster than R&D – and it seems VS2012 is just the start of a ramp-up stage.

For us, VS upgrade is a major hassle. It involves cross team coordination, chasing third party vendors for new builds, and the unavoidable ironing of various wrinkles that come with any migration. I just can’t justify this hassle with any tangible added value for us**. We’ll most likely ‘go Vista’ on VS2012, and just skip it quietly.

I’m anxious to return and test the vectorizer again – but at least after some service-packs*** for VS2013 are out****.
__________________________________________________

* That’s not to say no other significant progress was made in the processor space itself – transistors got tiny, caches got enormous, memory controllers and graphic processors were integrated in, etc. etc. I am saying that practically all the architectural innovations in the last decade were SIMD extensions.

** Well, there is this undocumented goodie that holds some very tangible added value for us, but probably not enough. More on that some day.

*** I know! updates, updates.

**** Footnotes are fun. Just saying.