Idiomatic option pricing and risk using Repa parallel arrays Idiomatic option pricing and risk using Repa parallel arrays arrays arrays

Idiomatic option pricing and risk using Repa parallel arrays


Your bug is unrelated to your code -- it is your use of -fext-core to print the output of compilation in the External Core format. Just don't do that (to see the core, I use ghc-core).

Compile with -O2 and -threaded:

$ ghc -O2 -rtsopts --make A.hs -threaded [1 of 1] Compiling Main             ( A.hs, A.o )Linking A ...

Then run with +RTS -N4, for example, to use 4 threads:

$ time ./A +RTS -N4[0.0,0.0,8.4375e-3,8.4375e-3,50.009375,50.009375,100.0,100.0]./A  0.00s user 0.00s system 85% cpu 0.008 total

So it completes too quickly to see a result. I'll increase your m and p parameters to 1k and 3k

$ time ./A +RTS -N2./A +RTS -N2  3.03s user 1.33s system 159% cpu 2.735 total

So yes, it does run in parallel. 1.6x on a 2 core machine, at a first attempt.Whether or not it is efficient is another question. Use +RTS -s you can see the runtime stats:

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

So we had 3 threads running in parallel (2 presumably for the algo, one for the IO manager).

You can reduce running time by adjusting the GC settings. E.g. by using -A we can reduce GC overhead, and yield genuine parallel speedups.

$ time ./A +RTS -N1 -A100M   ./A +RTS -N1 -A100M  1.99s user 0.29s system 99% cpu 2.287 total$ time ./A +RTS -N2 -A100M   ./A +RTS -N2 -A100M  2.30s user 0.86s system 147% cpu 2.145 total

You can improve numeric performance sometimes by using the LLVM backend. This seems to be the case here as well:

$ ghc -O2 -rtsopts --make A.hs -threaded -fforce-recomp -fllvm[1 of 1] Compiling Main             ( A.hs, A.o )Linking A ...$ time ./A +RTS -N2 -A100M                                    ./A +RTS -N2 -A100M  2.09s user 0.95s system 147% cpu 2.065 total

Nothing spectacular, but you are improving running time over your single threaded version, and I've not modified your original code in any way. To really improve things, you'll need to profile and optimize.

Revisiting the -A flags, we can bring the time down yet further using a tigher bound on the initial thread allocation area.

$ ghc -Odph -rtsopts --make A.hs -threaded -fforce-recomp -fllvm$ time ./A +RTS -N2 -A60M -s./A +RTS -N2 -A60M 1.99s user 0.73s system 144% cpu 1.880 total

So have brought it down to 1.8s from 2.7 (30% improvement) by using the parallel runtime, the LLVM backend, and being careful with GC flags. You can look at the GC flag surface to find the optimum:

enter image description here

With the trough around -A64 -N2 being ideal for the data set size.

I would also strongly consider using manual common subexpression elimination in your inner kernel, to avoid recomputing things excessively.

As Alp suggests, to see the runtime behavior of the program, compile threadscope (from Hackage) and run as follows:

$ ghc -O2 -fllvm -rtsopts -threaded -eventlog --make A.hs$ ./A +RTS -ls -N2 -A60M

And you get an event trace for your two cores like so:

enter image description here

So what's going on here? You have an initial period (0.8s) of setup time -- allocating your big list and encoding it into a repa array -- as you can see by the single threaded interleaving of GC and execution. Then there's another 0.8s of something on a single core, before your actual parallel work occurs for the last 300ms.

So while your actual algorithm may be parallelizing nicely, all the surrounding test setup basically swamps the result. If we serialize your dataset, and then just load it back from disk, we can get better behavior:

$ time ./A +RTS -N2 -A60M./A +RTS -N2 -A60M  1.76s user 0.25s system 186% cpu 1.073 total

and now your profile looks healthier:

enter image description here

This looks great! Very little GC (98.9% productivity), and my two cores running in parallel happily.

So, finally, we can see you get good parallelism:

With 1 core, 1.855s

$ time ./A +RTS -N1 -A25M./A +RTS -N1 -A25M  1.75s user 0.11s system 100% cpu 1.855 total

and with 2 cores, 1.014s

$ time ./A +RTS -N2 -A25M   ./A +RTS -N2 -A25M  1.78s user 0.13s system 188% cpu 1.014 total

Now, specifically answer your questions:

  1. Is there a more idiomatic way of doing this in repa?

In general, repa code should consist of parallel traverals, consumers and produces, and inlinable kernel functions. So as long as you are doing that, then the code is probably idiomatic. If in doubt, look at the tutorial. I'd in general mark your worker kernels (like f) as inlined.

  1. Will the above method actually price in parallel?

The code will execute in parallel if you use parallel combinators like computeP or the various maps and folds. So yes, it should and does run in parallel.

  1. How can I determine whether mycode really is generating something that will be executed inparallel?

Generally, you will know a priori because you use parallel operations. If in doubt, run the code and observe the speedup. You may then need to optimize the code.