Start at the end: the main example analyzed in the previous post is plain wrong. This loop:
for (int i=0; i<1000; ++i) sum += a[i];
Vectorizes perfectly.
Even after me wrongfully accusing his team with this fictitious vectorization miss, Jim Hogg was kind enough to (1) test it and report this reduction loop is indeed vectorized, (2) link to my post, and worse yet, (3) say he enjoyed this blog. What can I say, I’m embarrassed and humbled. Thanks Jim.
My mistake was not – as Jim suspected – omitting /fp:fast. Rather, the problem was I coded multiple simple tests into a single console app main function, and debugged the resulting binaries from ICC/MSVC in disassembly mode. From a more thorough inspection it seems both ICC and MSVC now do an aggressive interleaving of computations, and if as I suspect the aging PDB format still maps a consecutive range of instruction addresses to each source line – the debugger has a hard time matching location in disassembly to a source line. All in all, most probably I pulled the right conclusions on the wrong loops.
I did similar tests again – this time checking a single loop in every test. A different case quickly turned up where ICC vectorizes and MSVC doesn’t:
double a[2] = { 1., 2.}; double b[20000]; double S = 0; for(int i=0; i<20000; i+=2) S += a[0]*b[i] + a[1]*b[i+1] ;
And just to make extra sure, here’s some disassembly:
MSVC:
ICC:
ICC does some loop unrolling too so the code is harder to follow – but for skimming purposes it suffices to note the ‘packed double’ mul version (mulpd) in ICC, contrasted with the ‘scalar double’ mul version (mulsd) in MSVC. Similar results are seen in single precision floats too.
As in the previous post, this is simplified code that aims to capture the essence of real vectorizable scenarios. Suppose, for example, you need to transform a 3D mesh by a fixed rotation and translation. This amounts to a large loop with computations of the above type: one argument constant, the other scanning an array. Such code might benefit considerably from auto vectorization.
The real test was the last one to be described at the blog: build and measure some real life computationally intensive code. I did just that, and the results were – as noted – no measurable improvement over VC10. So either my code has less to benefit from vectorization than I hoped, or – the gaps remaining in the vectorizer hold more promise than the gaps already filled.
I gotta try and measure performance with ICC one day – if I’ll ever have the patience. Our code builds for nearly half an hour on MSVC, so I’m guessing ICC builds would have to be done neither nightly or over-weekendly.
Pingback: A Day with VS11 Beta – part 2: Auto Vectorizer « Ofek's Visual C++ stuff
Hi Ofek,
Thanks for the report. I wish this vectorizer would solve all my problems – but I guess it’s a classic case of “too good to be true”.
I would be interested to hear the results of ICC on your code,if you ever get around to that.