Why does GCC generate such radically different assembly for nearly the same C code? Why does GCC generate such radically different assembly for nearly the same C code? c c

Why does GCC generate such radically different assembly for nearly the same C code?


Updated to sync with the OP's edit

By tinkering with the code, I've managed to see how GCC optimizes the first case.

Before we can understand why they are so different, first we must understand how GCC optimizes fast_trunc_one().

Believe it or not, fast_trunc_one() is being optimized to this:

int fast_trunc_one(int i) {    int mantissa, exponent;    mantissa = (i & 0x07fffff) | 0x800000;    exponent = 150 - ((i >> 23) & 0xff);    if (exponent < 0) {        return (mantissa << -exponent);             /* diff */    } else {        return (mantissa >> exponent);              /* diff */    }}

This produces the exact same assembly as the original fast_trunc_one() - register names and everything.

Notice that there are no xors in the assembly for fast_trunc_one(). That's what gave it away for me.


How so?


Step 1: sign = -sign

First, let's take a look at the sign variable. Since sign = i & 0x80000000;, there are only two possible values that sign can take:

  • sign = 0
  • sign = 0x80000000

Now recognize that in both cases, sign == -sign. Therefore, when I change the original code to this:

int fast_trunc_one(int i) {    int mantissa, exponent, sign, r;    mantissa = (i & 0x07fffff) | 0x800000;    exponent = 150 - ((i >> 23) & 0xff);    sign = i & 0x80000000;    if (exponent < 0) {        r = mantissa << -exponent;    } else {        r = mantissa >> exponent;    }    return (r ^ sign) + sign;}

It produces the exact same assembly as the original fast_trunc_one(). I'll spare you the assembly, but it is identical - register names and all.


Step 2: Mathematical reduction: x + (y ^ x) = y

sign can only take one of two values, 0 or 0x80000000.

  • When x = 0, then x + (y ^ x) = y then trivial holds.
  • Adding and xoring by 0x80000000 is the same. It flips the sign bit. Therefore x + (y ^ x) = y also holds when x = 0x80000000.

Therefore, x + (y ^ x) reduces to y. And the code simplifies to this:

int fast_trunc_one(int i) {    int mantissa, exponent, sign, r;    mantissa = (i & 0x07fffff) | 0x800000;    exponent = 150 - ((i >> 23) & 0xff);    sign = i & 0x80000000;    if (exponent < 0) {        r = (mantissa << -exponent);    } else {        r = (mantissa >> exponent);    }    return r;}

Again, this compiles to the exact same assembly - register names and all.


This above version finally reduces to this:

int fast_trunc_one(int i) {    int mantissa, exponent;    mantissa = (i & 0x07fffff) | 0x800000;    exponent = 150 - ((i >> 23) & 0xff);    if (exponent < 0) {        return (mantissa << -exponent);             /* diff */    } else {        return (mantissa >> exponent);              /* diff */    }}

which is pretty much exactly what GCC generates in the assembly.


So why doesn't the compiler optimize fast_trunc_two() to the same thing?

The key part in fast_trunc_one() is the x + (y ^ x) = y optimization. In fast_trunc_two() the x + (y ^ x) expression is being split across the branch.

I suspect that might be enough to confuse GCC to not make this optimization. (It would need to hoist the ^ -sign out of the branch and merge it into the r + sign at the end.)

For example, this produces the same assembly as fast_trunc_one():

int fast_trunc_two(int i) {    int mantissa, exponent, sign, r;    mantissa = (i & 0x07fffff) | 0x800000;    exponent = 150 - ((i >> 23) & 0xff);    sign = i & 0x80000000;    if (exponent < 0) {        r = ((mantissa << -exponent) ^ -sign) + sign;             /* diff */    } else {        r = ((mantissa >> exponent) ^ -sign) + sign;              /* diff */    }    return r;                                     /* diff */}


This is the nature of compilers. Assuming they will take the fastest or best path, is quite false. Anyone that implies that you don't need to do anything to your code to optimize because "modern compilers" fill in the blank, do the best job, make the fastest code, etc. Actually I saw gcc get worse from 3.x to 4.x on arm at least. 4.x might have caught up to 3.x by this point, but early on it produced slower code. With practice you can learn how to write your code so the compiler doesn't have to work as hard and as a result produces more consistent and expected results.

The bug here is your expectations of what will be produced, not what was actually produced. If you want the compiler to generate the same output, feed it the same input. Not mathematically the same, not kinda the same, but actually the same, no different paths, no sharing or distributing operations from one version to the other. This is a good exercise in understanding how to write your code and seeing what compilers do with it. Don't make the mistake of assuming that because one version of gcc for one processor target one day produced a certain result that that is a rule for all compilers and all code. You have to use many compilers and many targets to get a feel for what is going on.

gcc is pretty nasty, I invite you to look behind the curtain, look at the guts of gcc, try to add a target or modify something yourself. It is barely held together by duct tape and bailing wire. An extra line of code added or removed in critical places and it comes crumbling down. The fact that it has produced usable code at all is something to be pleased about, instead of worrying about why it didnt meet other expectations.

did you look at what different versions of gcc produce? 3.x and 4.x in particular 4.5 vs 4.6 vs 4.7, etc? and for different target processors, x86, arm, mips, etc or different flavors of x86 if that is the native compiler you use, 32 bit vs 64 bit, etc? And then llvm (clang) for different targets?

Mystical has done an excellent job in the thought process required to work through the problem of analyzing/optimizing the code, expecting a compiler to come up with any of that is, well, not expected of any "modern compiler".

Without getting into the math properties, code of this form

if (exponent < 0) {  r = mantissa << -exponent;                       /* diff */} else {  r = mantissa >> exponent;                        /* diff */}return (r ^ -sign) + sign;                           /* diff */

is going to lead the compiler to A: implement it in that form, perform the if-then-else then converge on common code to finish up and return. or B: save a branch since this is the tail end of the function. Also not bother with using or saving r.

if (exponent < 0) {  return((mantissa << -exponent)^-sign)+sign;} else {  return((mantissa << -exponent)^-sign)+sign;}

Then you can get into as Mystical pointed out the sign variable disappears all together for the code as written. I wouldn't expect the compiler to see the sign variable go away so you should have done that yourself and not forced the compiler to try to figure it out.

This is a perfect opportunity to dig into the gcc source code. It appears you have found a case where the optimizer saw one thing in one case then another thing in another case. Then take the next step and see if you can't get gcc to see that case. Every optimization is there because some individual or group recognized the optimization and intentionally put it there. For this optimization to be there and work every time someone has to put it there (and then test it, and then maintain it into the future).

Definitely do not assume that less code is faster and more code is slower, it is very easy to create and find examples of that not being true. It might more often than not be the case of less code being faster than more code. As I demonstrated from the start though you can create more code to save branching in that case or looping, etc and have the net result be faster code.

The bottom line is you fed a compiler different source and expected the same results. The problem is not the compiler output but the expectations of the user. It is fairly easy to demonstrate for a particular compiler and processor, the addition of one line of code that makes a whole function dramatically slower. For example why does changing a = b + 2; to a = b + c + 2; cause _fill_in_the_blank_compiler_name_ generate radically different and slower code? The answer of course being the compiler was fed different code on the input so it is perfectly valid for the compiler to generate different output. (even better is when you swap two unrelated lines of code and cause the output to change dramatically) There is no expected relationship between the complexity and size of the input to the complexity and size of the output. Feed something like this into clang:

for(ra=0;ra<20;ra++) dummy(ra);

It produced somewhere between 60-100 lines of assembler. It unrolled the loop. I didn't count the lines, if you think about it, it has to add, copy the result to the input to the function call, make the function call, three operations minimum. so depending on the target that is probably 60 instructions at least, 80 if four per loop, 100 if five per loop, etc.


Mysticial has already given a great explanation, but I thought I'd add, FWIW, that there's really nothing fundamental about why a compiler would make the optimization for one and not the other.

LLVM's clang compiler, for example, gives the same code for both functions (except for the function name), giving:

_fast_trunc_two:                        ## @fast_trunc_one        movl    %edi, %edx        andl    $-2147483648, %edx      ## imm = 0xFFFFFFFF80000000        movl    %edi, %esi        andl    $8388607, %esi          ## imm = 0x7FFFFF        orl     $8388608, %esi          ## imm = 0x800000        shrl    $23, %edi        movzbl  %dil, %eax        movl    $150, %ecx        subl    %eax, %ecx        js      LBB0_1        shrl    %cl, %esi        jmp     LBB0_3LBB0_1:                                 ## %if.then        negl    %ecx        shll    %cl, %esiLBB0_3:                                 ## %if.end        movl    %edx, %eax        negl    %eax        xorl    %esi, %eax        addl    %edx, %eax        ret

This code isn't as short as the first gcc version from the OP, but not as long as the second.

Code from another compiler (which I won't name), compiling for x86_64, produces this for both functions:

fast_trunc_one:        movl      %edi, %ecx                shrl      $23, %ecx                 movl      %edi, %eax                movzbl    %cl, %edx                 andl      $8388607, %eax            negl      %edx                      orl       $8388608, %eax            addl      $150, %edx                movl      %eax, %esi                movl      %edx, %ecx                andl      $-2147483648, %edi        negl      %ecx                      movl      %edi, %r8d                shll      %cl, %esi                 negl      %r8d                      movl      %edx, %ecx                shrl      %cl, %eax                 testl     %edx, %edx                cmovl     %esi, %eax                xorl      %r8d, %eax                addl      %edi, %eax                ret                         

which is fascinating in that it computes both sides of the if and then uses a conditional move at the end to pick the right one.

The Open64 compiler produces the following:

fast_trunc_one:     movl %edi,%r9d                      sarl $23,%r9d                       movzbl %r9b,%r9d                    addl $-150,%r9d                     movl %edi,%eax                      movl %r9d,%r8d                      andl $8388607,%eax                  negl %r8d                           orl $8388608,%eax                   testl %r8d,%r8d                     jl .LBB2_fast_trunc_one             movl %r8d,%ecx                      movl %eax,%edx                      sarl %cl,%edx                   .Lt_0_1538:    andl $-2147483648,%edi              movl %edi,%eax                      negl %eax                           xorl %edx,%eax                      addl %edi,%eax                      ret                                 .p2align 5,,31.LBB2_fast_trunc_one:    movl %r9d,%ecx                      movl %eax,%edx                      shll %cl,%edx                       jmp .Lt_0_1538                  

and similar, but not identical, code for fast_trunc_two.

Anyway, when it comes to optimization, it's a lottery — it is what it is... It isn't always easy to know why you code gets compiled any particular way.