UPDATE: While I still believe the overall conclusions below hold, the actual analysis of the main example is erroneous and kept here only out of respect for some external links. A detailed correction is in the following post. Thanks @JimHogg!
Vectorization is inherently a low level topic, so when discussing it there is a tendency to go deep into technical details – thereby losing both a large part of the audience and the wider perspective. I’ll try to avoid these pitfalls here – so no disassembly screenshots today.
Background
For over 15 years one of the main evolution directions for processors has been SIMD: the application of an operation in a single cycle (more or less) to multiple data elements, stored consecutively in a wide register. Initial incarnations were limited to integer types and so not very useful, but when floating point support had started – developer interest caught up.
The bread-and-butter of current vectorization, at least on desktop platforms, is SSE: an ever expanding brand of x86 instructions and registers introduced by Intel. SSE registers are 128-bit wide and every SSE instruction modifies such a register – thereby processing either 4 single-precision or two double-precision floats in a single cycle. Prior to SSE, MMX instructions operated on 64-bit wide registers, the post-SSE AVX instructions work on 256, rumor has it that the next core-2 generation would maintain 512b registers – you get the idea.
Of course someone needs to write code to execute all these fancy instructions. In most of these 15 years this burden laid on the developer – and to a large extent it still does. Either by .asm files, inline assembly or intrinsics – some low level maven has to code at instruction level to utilize all this extra processing power. Optimized libraries do help with some common math and image processing tasks, but to a large extent mainstream code has yet to benefit anything from vectorization. This may change if compilers become smart enough to recognize vectorizable portions of high-level code, and to do the vectorization themselves.
Auto-Vectorization is the coveted ability to do just that: have the compiler do the vectorization for you.
VS11 is Microsoft’s first step towards this noble goal. I read that the dev-preview was less than stellar, but haven’t seen anything about the beta. I care deeply for this particular feature and had high hopes for it, and so set out to explore.
Results
As the VC compiler guys demonstrate in this Channel 9 session, ‘embarrassingly parallel’ code vectorizes well:
float a[1000], b[1000], c[1000]; for(int i=0; i<1000; ++i) a[i] = b[i] + c[i];
Reduction poses an extra challenge, as it introduces a semantic dependence among loop iterations – but the following still vectorizes well:
float sum = 0.0f; for(int i=0; i<1000; ++i) sum += a[i];
As far as I my brief exploration went, the optimizer sees easily through object-oriented abstractions. In header-only container templates I also saw many cross-function vectorizations (namely, where inlining took place).
But.
A Big But
The party stops here:
for(int i=0; i<1000; ++i) sum += a[i]*b[i];
This code does not vectorize in VS11 beta.
Mind you, this is not a contrived example – it is a fundamental computation type in BLAS, 3D and pretty much every quazi-mathematical code: vector-vector dot product, matrix-vector product, matrix-matrix product – are all essentially such sums of products.
Now vectorization plainly is hard, and this loop in particular. Data dependencies that prevent auto vectorization (and parallelization in general) are –
- Read after write,
- Write after read,
- write after write.
And the sum-of-products loop above seems to contain all three . However:
- The same holds for the direct sum reduction loop: for(…) sum += a[i], that VC11 vectorizes successfully.
- It just so happens that the Intel compiler documentation uses this very code as example for vectorization of reduction loops:
…there is a very important exception, that apparently contains all of the above types of dependency:
sum=0;
for (j=1; j<MAX; j++) sum = sum + A[j]*B[j]
Although sum is both read and written in every iteration, the [Intel] compiler recognizes such reduction idioms, and is able to vectorize them safely.
Competition
Optimization-wise the Intel compiler is really where the bar is set. Not only does it successfully vectorize a much wider range of cases, it supplies extensive pragmas and switches to control vectorization and can even report which loops were vectorized, and to some extent – why vectorization failed where it did. (In all the tests above I inspected disassembly to tell which loops vectorized in MSVC).
The lack of all this functionality is very much acceptable in beta – and who knows, maybe some of it is actually there, details slowly unveil in Jim Hogg’s series of posts. I hope even more would be exposed in the RTM – but I would be (pleasantly) surprised if the range of vectorizable loops would be broadened.
Intel’s compiler is markedly slower and it might seem that MS deliberately makes different performance/productivity tradeoffs, but now that it’s obvious MS does want to get vectorization done – it’s also obvious that they’re simply playing catch-up in this field.
Bottom Lines
The code of most products I work on is a classic example of case where auto-vectorizer should be beneficial: the tasks are highly intensive computationally, the code is entirely C++ with almost no low-level optimizations, it is also highly branch intensive and it is hard to find single bottlenecks where manual optimizations would justify the investment.
I built and run one of our core products in VC11 beta and measured a ~10 minutes computational job. I saw no performance improvement over VC10 (VC11 actually took a bit longer, but well within statistical error).
As of May 2012, devs still must hand-optimize the code to rip the benefits of ~1995 processor improvements. Perhaps you’d be the one to change that?
Nice write-up. But I’d like to mention that the code snippets are cut-off at the angled bracked of the for-loops.
Thanks – no idea what happened there. Fixed back now.
From my POV it’s not very important if the instructions are in the ISA or not since I can afford to recompile mycode for new targets and that I don’t program in assembly or not even with the intrinsics but with higher level wrapper classes to enjoy far more readable and maintanable code thanks to C operators overloading. I have typically already a 256-bit packed integer class for example or functionslikeScatter/Gather/Compress/Swizzle/Deswizzle/… Only actual timings are important for the final users, and only the quality of the source code should be important for the coders, well IMHO.
That’s actually a clever idea, that I never used – I started out with inline assembly, but later switched to intrinsics (indeed for improved maintainability). Might give it a try some day. Thanks!