What is the fastest integer division supporting division by zero no matter what the result is? What is the fastest integer division supporting division by zero no matter what the result is? c c

What is the fastest integer division supporting division by zero no matter what the result is?


Inspired by some of the comments I got rid of the branch on my Pentium and gcc compiler using

int f (int x, int y){        y += y == 0;        return x/y;}

The compiler basically recognizes that it can use a condition flag of the test in the addition.

As per request the assembly:

.globl f    .type   f, @functionf:    pushl   %ebp    xorl    %eax, %eax    movl    %esp, %ebp    movl    12(%ebp), %edx    testl   %edx, %edx    sete    %al    addl    %edx, %eax    movl    8(%ebp), %edx    movl    %eax, %ecx    popl    %ebp    movl    %edx, %eax    sarl    $31, %edx    idivl   %ecx    ret

As this turned out to be such a popular question and answer, I'll elaborate a bit more. The above example is based on programming idiom that a compiler recognizes. In the above case a boolean expression is used in integral arithmetic and the use of condition flags are invented in hardware for this purpose. In general condition flags are only accessible in C through using idiom. That is why it so hard to make a portable multiple precision integer library in C without resorting to (inline) assembly. My guess is that most decent compilers will understand the above idiom.

Another way of avoiding branches, as also remarked in some of the above comments, is predicated execution. I therefore took philipp's first code and my code and ran it through the compiler from ARM and the GCC compiler for the ARM architecture, which features predicated execution. Both compilers avoid the branch in both samples of code:

Philipp's version with the ARM compiler:

f PROC        CMP      r1,#0        BNE      __aeabi_idivmod        MOVEQ    r0,#0        BX       lr

Philipp's version with GCC:

f:        subs    r3, r1, #0        str     lr, [sp, #-4]!        moveq   r0, r3        ldreq   pc, [sp], #4        bl      __divsi3        ldr     pc, [sp], #4

My code with the ARM compiler:

f PROC        RSBS     r2,r1,#1        MOVCC    r2,#0        ADD      r1,r1,r2        B        __aeabi_idivmod

My code with GCC:

f:        str     lr, [sp, #-4]!        cmp     r1, #0        addeq   r1, r1, #1        bl      __divsi3        ldr     pc, [sp], #4

All versions still need a branch to the division routine, because this version of the ARM doesn't have hardware for a division, but the test for y == 0 is fully implemented through predicated execution.


Here are some concrete numbers, on Windows using GCC 4.7.2:

#include <stdio.h>#include <stdlib.h>int main(){  unsigned int result = 0;  for (int n = -500000000; n != 500000000; n++)  {    int d = -1;    for (int i = 0; i != ITERATIONS; i++)      d &= rand();#if CHECK == 0    if (d == 0) result++;#elif CHECK == 1    result += n / d;#elif CHECK == 2    result += n / (d + !d);#elif CHECK == 3    result += d == 0 ? 0 : n / d;#elif CHECK == 4    result += d == 0 ? 1 : n / d;#elif CHECK == 5    if (d != 0) result += n / d;#endif  }  printf("%u\n", result);}

Note that I am intentionally not calling srand(), so that rand() always returns exactly the same results. Note also that -DCHECK=0 merely counts the zeroes, so that it is obvious how often appeared.

Now, compiling and timing it various ways:

$ for it in 0 1 2 3 4 5; do for ch in 0 1 2 3 4 5; do gcc test.cc -o test -O -DITERATIONS=$it -DCHECK=$ch && { time=`time ./test`; echo "Iterations $it, check $ch: exit status $?, output $time"; }; done; done

shows output that can be summarised in a table:

Iterations → | 0        | 1        | 2        | 3         | 4         | 5-------------+-------------------------------------------------------------------Zeroes       | 0        | 1        | 133173   | 1593376   | 135245875 | 373728555Check 1      | 0m0.612s | -        | -        | -         | -         | -Check 2      | 0m0.612s | 0m6.527s | 0m9.718s | 0m13.464s | 0m18.422s | 0m22.871sCheck 3      | 0m0.616s | 0m5.601s | 0m8.954s | 0m13.211s | 0m19.579s | 0m25.389sCheck 4      | 0m0.611s | 0m5.570s | 0m9.030s | 0m13.544s | 0m19.393s | 0m25.081sCheck 5      | 0m0.612s | 0m5.627s | 0m9.322s | 0m14.218s | 0m19.576s | 0m25.443s

If zeroes are rare, the -DCHECK=2 version performs badly. As zeroes start appearing more, the -DCHECK=2 case starts performing significantly better. Out of the other options, there really isn't much difference.

For -O3, though, it is a different story:

Iterations → | 0        | 1        | 2        | 3         | 4         | 5-------------+-------------------------------------------------------------------Zeroes       | 0        | 1        | 133173   | 1593376   | 135245875 | 373728555Check 1      | 0m0.646s | -        | -        | -         | -         | -Check 2      | 0m0.654s | 0m5.670s | 0m9.905s | 0m14.238s | 0m17.520s | 0m22.101sCheck 3      | 0m0.647s | 0m5.611s | 0m9.085s | 0m13.626s | 0m18.679s | 0m25.513sCheck 4      | 0m0.649s | 0m5.381s | 0m9.117s | 0m13.692s | 0m18.878s | 0m25.354sCheck 5      | 0m0.649s | 0m6.178s | 0m9.032s | 0m13.783s | 0m18.593s | 0m25.377s

There, check 2 has no drawback compared the other checks, and it does keep the benefits as zeroes become more common.

You should really measure to see what happens with your compiler and your representative sample data, though.


Without knowing the platform there is no way to know the exact most efficient method, however, on a generic system this may close to the optimum (using Intel assembler syntax):

(assume divisor is in ecx and the dividend is in eax)

mov ebx, ecxneg ebxsbb ebx, ebxadd ecx, ebxdiv eax, ecx

Four unbranched, single-cycle instructions plus the divide. The quotient will be in eax and the remainder will be in edx at the end. (This kind of shows why you don't want to send a compiler to do a man's job).