Haskell/GHC per thread memory costs Haskell/GHC per thread memory costs multithreading multithreading

Haskell/GHC per thread memory costs


IMHO, the culprit is threadDelay. *threadDelay** uses a lot of memory. Here is a program equivalent to yours that behaves better with memory. It ensures that all the threads are running concurrently by having a long-running computation.

uBound = 38lBound = 34doSomething :: Integer -> IntegerdoSomething 0 = 1doSomething 1 = 1doSomething n | n < uBound && n > 0 = let                  a = doSomething (n-1)                   b = doSomething (n-2)                 in a `seq` b `seq` (a + b)              | otherwise = doSomething (n `mod` uBound )e :: Chan Integer -> Int -> IO ()e mvar i =     do        let y = doSomething . fromIntegral $ lBound + (fromIntegral i `mod` (uBound - lBound) )         y `seq` writeChan mvar ymain =     do        args <- getArgs        let (numThreads, sleep) = case args of                                    numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)                                    _ -> error "wrong args"            dld = (sleep*1000*1000)         chan <- newChan        mapM_ (\i -> forkIO $ e chan i) [1..numThreads]        putStrLn "All threads created"        mapM_ (\_ -> readChan chan >>= putStrLn . show ) [1..numThreads]        putStrLn "All read"

And here are the timing statistics:

 $ ghc -rtsopts -O -threaded  test.hs $ ./test 200 10 +RTS -sstderr -N4 133,541,985,480 bytes allocated in the heap     176,531,576 bytes copied during GC         356,384 bytes maximum residency (16 sample(s))          94,256 bytes maximum slop               4 MB total memory in use (0 MB lost due to fragmentation)                                     Tot time (elapsed)  Avg pause  Max pause  Gen  0     64246 colls, 64246 par    1.185s   0.901s     0.0000s    0.0274s  Gen  1        16 colls,    15 par    0.004s   0.002s     0.0001s    0.0002s  Parallel GC work balance: 65.96% (serial 0%, perfect 100%)  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)  INIT    time    0.000s  (  0.003s elapsed)  MUT     time   63.747s  ( 16.333s elapsed)  GC      time    1.189s  (  0.903s elapsed)  EXIT    time    0.001s  (  0.000s elapsed)  Total   time   64.938s  ( 17.239s elapsed)  Alloc rate    2,094,861,384 bytes per MUT second  Productivity  98.2% of total user, 369.8% of total elapsedgc_alloc_block_sync: 98548whitehole_spin: 0gen[0].sync: 0gen[1].sync: 2

Maximum residency is at around 1.5 kb per thread. I played a bit with the number of threads and the running length of the computation. Since threads start doing stuff immediately after forkIO, creating 100000 threads actually takes a very long time. But the results held for 1000 threads.

Here is another program where threadDelay has been "factored out", this one doesn't use any CPU and can be executed easily with 100000 threads:

e :: MVar () -> MVar () -> IO ()e start end =     do        takeMVar start        putMVar end ()main =     do        args <- getArgs        let (numThreads, sleep) = case args of                                    numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int)                                    _ -> error "wrong args"        starts <- mapM (const newEmptyMVar ) [1..numThreads]        ends <- mapM (const newEmptyMVar ) [1..numThreads]        mapM_ (\ (start,end) -> forkIO $ e start end) (zip starts ends)        mapM_ (\ start -> putMVar start () ) starts        putStrLn "All threads created"        threadDelay (sleep * 1000 * 1000)        mapM_ (\ end -> takeMVar end ) ends        putStrLn "All done"

And the results:

     129,270,632 bytes allocated in the heap     404,154,872 bytes copied during GC      77,844,160 bytes maximum residency (10 sample(s))      10,929,688 bytes maximum slop             165 MB total memory in use (0 MB lost due to fragmentation)                                     Tot time (elapsed)  Avg pause  Max pause  Gen  0       128 colls,   128 par    0.178s   0.079s     0.0006s    0.0152s  Gen  1        10 colls,     9 par    0.367s   0.137s     0.0137s    0.0325s  Parallel GC work balance: 50.09% (serial 0%, perfect 100%)  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)  INIT    time    0.000s  (  0.001s elapsed)  MUT     time    0.189s  ( 10.094s elapsed)  GC      time    0.545s  (  0.217s elapsed)  EXIT    time    0.001s  (  0.002s elapsed)  Total   time    0.735s  ( 10.313s elapsed)  Alloc rate    685,509,460 bytes per MUT second  Productivity  25.9% of total user, 1.8% of total elapsed

On my i5, it takes less than one second to create the 100000 threads and put the "start" mvar. The peak residency is at around 778 bytes per thread, not bad at all!


Checking threadDelay's implementation, we see that it is effectively different for the threaded and unthreaded case:

https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Conc.IO.html#threadDelay

Then here: https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Event.TimerManager.html

which looks innocent enough. But an older version of base has an arcane spelling of (memory) doom for those that invoke threadDelay:

https://hackage.haskell.org/package/base-4.4.0.0/docs/src/GHC-Event-Manager.html#line-121

If there is still an issue or not, it is hard to say. However, one can always hope that a "real life" concurrent program won't need to have too many threads waiting on threadDelay at the same time. I for one will keep an eye on my usage of threadDelay from now on.