Maximum speed from IOS/iPad/iPhone Maximum speed from IOS/iPad/iPhone xcode xcode

Maximum speed from IOS/iPad/iPhone


If you are doing a lot of floating point calculations, it would benefit you greatly to use Apple's Accelerate framework. It is designed to use the floating point hardware to do calculations on vectors in parallel.

I will also address your points one by one:

1) This is not because of the CPU, it is because as of the armv7-era only 32-bit floating point operations will be calculated in the floating point processor hardware (because apple replaced the hardware). 64-bit ones will be calculated in software instead. In exchange, 32-bit operations got much faster.

2) NEON is the name of the new floating point processor instruction set

3) Yes, this is a well known method. An alternative is to use Apple's framework that I mentioned above. It provides sin and cos functions that calculate 4 values in parallel. The algorithms are fine tuned in assembly and NEON so they give the maximum performance while using minimal battery.

4) The new armv7 implementation of thumb doesn't have the drawbacks of armv6. The disabling recommendation only applies to v6.

5) Yes, considering 80% of users are on iOS 5.0 or above now (armv6 devices ended support at 4.2.1), that is perfectly acceptable for most situations.

6) This happens automatically when you build in release mode.

7) Yes, this won't have as large an effect as the above methods though.

My recommendation is to check out Accelerate. That way you can make sure you are leveraging the full power of the floating point processor.


I provide some feedback to previous posts. This explains some idea I tried to provide about dead code in point 7. This was meant to be slightly wider idea. I need formatting, so no comment form can be used. Such code was in OpenCV:

for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {    vec[kk] = 0;}

I wanted to see how it looks on assembly. To make sure I can find it in assembly, I wrapped it like this:

__asm__("#start");for( kk = 0; kk < (int)(descriptors->elem_size/sizeof(vec[0])); kk++ ) {    vec[kk] = 0;}__asm__("#stop");

Now I press "Product -> Generate Output -> Assembly file" and what I get is:

    @ InlineAsm Start    #start    @ InlineAsm EndLtmp1915:    ldr r0, [sp, #84]    movs    r1, #0    ldr r0, [r0, #16]    ldr r0, [r0, #28]    cmp r0, #4    mov r0, r4    blo LBB14_71LBB14_70:Ltmp1916:    ldr r3, [sp, #84]    movs    r2, #0Ltmp1917:    str r2, [r0], #4    adds    r1, #1Ltmp1918:Ltmp1919:    ldr r2, [r3, #16]    ldr r2, [r2, #28]    lsrs    r2, r2, #2    cmp r2, r1    bgt LBB14_70LBB14_71:Ltmp1920:    add.w   r0, r4, #8    @ InlineAsm Start    #stop    @ InlineAsm End

A lot of code. I printf-d out value of (int)(descriptors->elem_size/sizeof(vec[0])) and it was always 64. So I hardcoded it to be 64 and passed again via assembler:

    @ InlineAsm Start    #start    @ InlineAsm EndLtmp1915:    vldr.32 s16, LCPI14_7    mov r0, r4    movs    r1, #0    mov.w   r2, #256    blx _memset    @ InlineAsm Start    #stop    @ InlineAsm End

As you might see now optimizer got the idea and code became much shorter. It was able to vectorize this. Point is that compiler always does not know what inputs are constants if this is something like webcam camera size or pixel depth but in reality in my contexts they are usually constant and all I care about is speed.

I also tried Accelerate as suggested replacing three lines with:

__asm__("#start");vDSP_vclr(vec,1,64);__asm__("#stop");

Assembly now looks:

    @ InlineAsm Start    #start    @ InlineAsm EndLtmp1917:    str r1, [r7, #-140]Ltmp1459:Ltmp1918:    movs    r1, #1    movs    r2, #64    blx _vDSP_vclrLtmp1460:Ltmp1919:    add.w   r0, r4, #8    @ InlineAsm Start    #stop    @ InlineAsm End

Unsure if this is faster than bzero though. In my context this part does not time much time and two variants seemed to work at same speed.

One more thing I learned is using GPU. More about it here http://www.sunsetlakesoftware.com/2012/02/12/introducing-gpuimage-framework