Is there a more efficient way to get the length of a 32bit integer in bytes?
How about this one?
inline int len(uint32 val){ return 4 - ((val & 0xff000000) == 0) - ((val & 0xffff0000) == 0) - ((val & 0xffffff00) == 0) ;}
Removing the inline
keyword, g++ -O2
compiles this to the following branchless code:
movl 8(%ebp), %edxmovl %edx, %eaxandl $-16777216, %eaxcmpl $1, %eaxsbbl %eax, %eaxaddl $4, %eaxxorl %ecx, %ecxtestl $-65536, %edxsete %clsubl %ecx, %eaxandl $-256, %edxsete %dlmovzbl %dl, %edxsubl %edx, %eax
If you don't mind machine-specific solutions, you can use the bsr
instruction which searches for the first 1 bit. Then you simply divide by 8 to convert bits to bytes and add 1 to shift the range 0..3 to 1..4:
int len(uint32 val){ asm("mov 8(%ebp), %eax"); asm("or $255, %eax"); asm("bsr %eax, %eax"); asm("shr $3, %eax"); asm("inc %eax"); asm("mov %eax, 8(%ebp)"); return val;}
Note that I am not an inline assembly god, so maybe there's a better to solution to access val
instead of addressing the stack explicitly. But you should get the basic idea.
The GNU compiler also has an interesting built-in function called __builtin_clz
:
inline int len(uint32 val){ return ((__builtin_clz(val | 255) ^ 31) >> 3) + 1;}
This looks much better than the inline assembly version to me :)
I did a mini unscientific benchmark just measuring the difference in GetTickCount() calls when calling the function in a loop from 0 to MAX_LONG times under the VS 2010 compiler.
Here's what I saw:
This took 11497 ticks
inline int len(uint32 val){ if(val <= 0x000000ff) return 1; if(val <= 0x0000ffff) return 2; if(val <= 0x00ffffff) return 3; return 4;}
While this took 14399 ticks
inline int len(uint32 val){ return 4 - ((val & 0xff000000) == 0) - ((val & 0xffff0000) == 0) - ((val & 0xffffff00) == 0) ;}
edit: my idea about why one was faster is wrong because:
inline int len(uint32 val){ return 1 + (val > 0x000000ff) + (val > 0x0000ffff) + (val > 0x00ffffff) ;}
This version used only 11107 ticks. Since + is faster than - perhaps? I'm not sure.
Even faster though was the binary search at 7161 ticks
inline int len(uint32 val){ if (val & 0xffff0000) return (val & 0xff000000)? 4: 3; return (val & 0x0000ff00)? 2: 1;}
And fastest so far is using the MS intrinsic function, at 4399 ticks
#pragma intrinsic(_BitScanReverse)inline int len2(uint32 val){ DWORD index; _BitScanReverse(&index, val); return (index>>3)+1;}
For reference - here's the code i used to profile:
int _tmain(int argc, _TCHAR* argv[]){ int j = 0; DWORD t1,t2; t1 = GetTickCount(); for(ULONG i=0; i<-1; i++) j=len(i); t2 = GetTickCount(); _tprintf(_T("%ld ticks %ld\n"), t2-t1, j); t1 = GetTickCount(); for(ULONG i=0; i<-1; i++) j=len2(i); t2 = GetTickCount(); _tprintf(_T("%ld ticks %ld\n"), t2-t1, j);}
Had to print j to prevent the loops from being optimized out.
Do you really have profile evidence that this is a significant bottleneck in your application? Just do it the most obvious way and only if profiling shows it to be a problem (which I doubt), then try to improve things. Most likely you'll get the best improvement by reducing the number of calls to this function than by changing something within it.