How to improve performance of this numerical computation in Haskell?
Use the same control and data structures, yielding:
{-# LANGUAGE BangPatterns #-}{-# OPTIONS_GHC -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-}{-# INLINE trigamma #-}trigamma :: Double -> Doubletrigamma x = go 0 (x' - 1) p' where x' = x + 6 p = 1 / (x' * x') p' =(((((0.075757575757576*p-0.033333333333333)*p+0.0238095238095238) *p-0.033333333333333)*p+0.166666666666667)*p+1)/x'+0.5*p go :: Int -> Double -> Double -> Double go !i !x !p | i >= 6 = p | otherwise = go (i+1) (x-1) (1 / (x*x) + p)
I don't have your testsuite, but this yields the following asm:
A_zdwgo_info: cmpq $5, %r14 jg .L3 movsd .LC0(%rip), %xmm7 movapd %xmm5, %xmm8 movapd %xmm7, %xmm9 mulsd %xmm5, %xmm8 leaq 1(%r14), %r14 divsd %xmm8, %xmm9 subsd %xmm7, %xmm5 addsd %xmm9, %xmm6 jmp A_zdwgo_info
Which looks ok. This is the kind of code the -fllvm
backend does a good job.
GCC unrolls the loop though, and the only way to do that is either via Template Haskell or manual unrolling. You might consider that (a TH macro) if doing a lot of this.
Actually, the GHC LLVM backend does unroll the loop :-)
Finally, if you really like the original Haskell version, write it using stream fusion combinators, and GHC will convert it back into loops. (Exercise for the reader).
Before the optimization work, I wouldn't say that your original translation is the most idiomatic way to express in Haskell what the C code is doing.
How would the optimization process have proceeded if we started with the following instead:
trigamma :: Double -> Doubletrigamma x = foldl' (+) p' . map invSq . take 6 . iterate (+ 1) $ xwhere invSq y = 1 / (y * y) x' = x + 6 p = invSq x' p' =(((((0.075757575757576*p-0.033333333333333)*p+0.0238095238095238) *p-0.033333333333333)*p+0.166666666666667)*p+1)/x'+0.5*p