Sunday, December 28, 2008

Automatic Vectorization - does it work?

In my previous post I showed an example how smart GCC can be. But the most incredible feature of GCC is automatic vectorization. Normally, game developers etc use 'intrinsics', that is, predefined small inline functions that translate directly into SIMD instructions, to exploit vector processing capabilities of the CPU. All major compilers: GNU, Intel's, and Microsoft's, provide support for intrinsics. However, GNU is the only one (?) trying to vectorize, without human intervention, the code as a step of optimization. Of course, we need to give some hints how to group data into vectors, using the vector_size attribute. For example, the following C function:


will produce this output:


_addv4:
addps %xmm1, %xmm0
ret

Impressive! The arguments are passed in XMM registers, so there is even no read/write from memory!

Should I worry about my job? Will compilers finally outsmart humans in producing better assembly output? Not yet... If we take another, very similar example:


- we just want vectors of 3 floats, instead of 4, which is in a way natural, living in a 3D space; we get this error message from GCC:

error: number of components of the vector not a power of two

Oops! Good that I don't get this message from malloc() when I want to reserve lets say 95 bytes. Next, look at the following simple function:


Compiling with gcc -S -O3 -msse3 -fomit-frame-pointer -foptimize-register-move gives


What is disturbing for me is that register moves are not optimized. Basically all the moveaps instructions could be optimized away, if we pack the floats in the right registers, something like this:


It seems that for performance critical applications or code segments we'd better do it manually, in assembly or using intrinsics.

0 comments: