How to trick C/C++ compilers into generating terrible code?

caf · on Oct 22, 2011

These issues are why various compiler hints exist.

In the dead code example, the gcc function attribute 'const' can be applied to the declaration of bar(), telling the compiler that it is a pure function whose result depends on nothing but its arguments.

In the pointer example, the C99 standard 'restrict' qualifier can be applied to a, b and c to tell the compiler that the values pointed to by these variables do not overlap.

'restrict' will also help the global variable example - the reason that N is loaded each time around the loop is because as far as the compiler knows, one of the a[i]s could alias with N.

ajross · on Oct 22, 2011

Mild quibble: the attribute you want is "pure", not "const". The distinction is that a const function inspects nothing but its arguments, but a pure function is allowed to read (but not write) external memory. Both are without side effects and can be optimized out of loops, but pure is looser. Not all const functions can be pure.

WalterBright · on Oct 22, 2011

"pure" in D causes the compiler to disallow, within that function, writes or reads from global mutable data. This inevitably follows from the principle that if you pass the same arguments to a pure function, you'll get the same result returned.

A pure function is still allowed to allocate memory, though, and throw an exception.

cpeterso · on Oct 22, 2011

A subtle point that bit me, const functions should not read from pointer arguments, such as const char pointers. The pointed to data is external (to the const function) memory.

Also, gcc does not seem to warn about reading eternal memory. It seems like this would be an easy error for the compiler to detect.

caf · on Oct 22, 2011

Yes, that's why I said "...whose result depends on nothing but its arguments." - the example bar() function in the original article does not read global memory, so it can be declared __attribute__((const)).

ajross · on Oct 24, 2011

Sure. I was just pointing out that that is a stricter constraint than required for the optimization in question. Doing loop hoisting and CSE wants "pure" functions, because pure "means" a function without side effects.

The "meaning" of const is that the function depends on nothing but its arguments, and can therefore have its value computed at compile time, or be part of a global CSE pass. That's a different optimization.

caf · on Oct 25, 2011

It seems to me that loop hoisting would also be easier with ((const)), because to do so with a ((pure)) function requires further assessing that the loop does not modify any global state that might be visible to the ((pure)) function. A ((const)) function can be hoisted out of a loop even if the loop modifies globals, or values through pointers that might point at globals.

notJim · on Oct 22, 2011

I'm glad to see pointer aliasing mentioned here. Back in the day, I was writing fluid mechanics simulations in FORTRAN. Writing in FORTRAN is awful, of course, so I did some research into why FORTRAN is considered to be faster than C++ for these simulations. Of course, a lot of it is history, and the fact that the people writing the simulations are engineers first and programmers a distant second, but another thing that seemed to come up was that due to pointer aliasing (which was absent in fortran, or at least made more explicit), FORTRAN compilers were able to implement some important optimizations that C++ compilers couldn't. I wanted to experiment a little bit with the C99 restrict keyword, to see if it would produce similar results to FORTRAN, but I never really got around to it.

eliasmacpherson · on Oct 22, 2011

Although it's a poorly titled article, it was interesting to read. Surely the objective is to trick the compiler into generating the best code. I was surprised that the vectorisation it mentions was not performed automatically.

One other way would be to target different hardware than it's designed to work with via flags, or by using AMD with the intel compiler mentioned in the article. There was a very short discussion about this on reddit yesterday http://www.reddit.com/r/programming/comments/lj1ze/ask_rprog...

froydnj · on Oct 22, 2011

> I was surprised that the vectorisation it mentions was not performed automatically.

The OP didn't mention what compiler was being used; GCC will certainly automatically vectorize this example (and ICC probably will as well). I used GCC 4.4 for x86 with -O3 -msse2.

Of course, there's a lot of compensation code inserted for unaligned pointers and aliased pointers and the like, but automatic vectorization is certainly doable.

JoeAltmaier · on Oct 22, 2011

Depends entirely upon the compiler. Don't worry about this stuff in general unless you're doing embedded work, where the compilers are often problematical. Any desktop compiler will know more about generating code than you do.

sliverstorm · on Oct 22, 2011

Embedded compilers are perfectly intelligent; I would hazard that you just wind up doing weird things more often, and/or you care more what exactly it does with this or that function because of your 32kHz clock and/or 1KB of program memory.

tryp · on Oct 22, 2011

Many embedded compilers and assemblers are terribly buggy. Having worked with dozens, my impression is that the compilers targeting 8-bit microcontrollers are generally of similar quality to 80's x86 compilers. Emission of incorrect assembly given correct code is rare, but optimizations are incredibly weak and the compilers segfault relatively often.

Some of the DSP tool suites have solid compilers that optimize insightfully for their target architecture but whose in-circuit debuggers are tied to flaky IDEs. I'm currently working with an XDP debugger on a Sandy Bridge board that requires the debugger software to be restarted nearly hourly, often corrupting the project file requiring me to enter the memory map again.

Lately I've been thrilled to spec in ARM micros because I can just use GCC, an el-cheapo universal USB JTAG adapter through OpenOCD, and expect everything to work.

dhruvbird · on Oct 22, 2011

Interesting point about ICC generating faster code and the link order making a difference in the running time of the application.

larsberg · on Oct 22, 2011

The biggest thing you'll notice in practice from ICC is that it is much more likely to unroll a loop and transform the body to use SSE instructions when requested. If you have dense numeric code and aren't either using ICC or hand-unrolling and using GCC intrinsics, you're probably leaving performance on the floor.

But, there's still no free lunch. For example, as of about two months ago, ICC will unroll loops whose increment is "i++" but will not unroll loops whose increment is "i+=1". Some insight, looking at output assembly, etc. is still required.

dhruvbird · on Oct 23, 2011

Interesting. Another thing I noticed is that code runs faster if floating point and integer arithmetic instructions are interleaved rather than "blocked" together.

DarkShikari · on Oct 22, 2011

When I declared N as a global variable, the compiler left it in memory and added a load instruction to the loop. When I put N as a local variable, the compiler loaded it into a register. While I do blame the compiler for this particular behavior (because I did not declare N as volatile), we have to work with what we have.

This is because global variables can be modified at runtime unless they're const. The compiler cannot guarantee it hasn't been declared extern and modified by some other file without sufficiently powerful link-time optimization.

b0b0b0b · on Oct 22, 2011

These issues are also why sql hints exist.