Haskell/GHC per thread memory costs
IMHO, the culprit is threadDelay. *threadDelay** uses a lot of memory. Here is a program equivalent to yours that behaves better with memory. It ensures that all the threads are running concurrently by having a long-running computation.
uBound = 38lBound = 34doSomething :: Integer -> IntegerdoSomething 0 = 1doSomething 1 = 1doSomething n | n < uBound && n > 0 = let a = doSomething (n-1) b = doSomething (n-2) in a `seq` b `seq` (a + b) | otherwise = doSomething (n `mod` uBound )e :: Chan Integer -> Int -> IO ()e mvar i = do let y = doSomething . fromIntegral $ lBound + (fromIntegral i `mod` (uBound - lBound) ) y `seq` writeChan mvar ymain = do args <- getArgs let (numThreads, sleep) = case args of numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int) _ -> error "wrong args" dld = (sleep*1000*1000) chan <- newChan mapM_ (\i -> forkIO $ e chan i) [1..numThreads] putStrLn "All threads created" mapM_ (\_ -> readChan chan >>= putStrLn . show ) [1..numThreads] putStrLn "All read"
And here are the timing statistics:
$ ghc -rtsopts -O -threaded test.hs $ ./test 200 10 +RTS -sstderr -N4 133,541,985,480 bytes allocated in the heap 176,531,576 bytes copied during GC 356,384 bytes maximum residency (16 sample(s)) 94,256 bytes maximum slop 4 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 64246 colls, 64246 par 1.185s 0.901s 0.0000s 0.0274s Gen 1 16 colls, 15 par 0.004s 0.002s 0.0001s 0.0002s Parallel GC work balance: 65.96% (serial 0%, perfect 100%) TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.003s elapsed) MUT time 63.747s ( 16.333s elapsed) GC time 1.189s ( 0.903s elapsed) EXIT time 0.001s ( 0.000s elapsed) Total time 64.938s ( 17.239s elapsed) Alloc rate 2,094,861,384 bytes per MUT second Productivity 98.2% of total user, 369.8% of total elapsedgc_alloc_block_sync: 98548whitehole_spin: 0gen[0].sync: 0gen[1].sync: 2
Maximum residency is at around 1.5 kb per thread. I played a bit with the number of threads and the running length of the computation. Since threads start doing stuff immediately after forkIO, creating 100000 threads actually takes a very long time. But the results held for 1000 threads.
Here is another program where threadDelay has been "factored out", this one doesn't use any CPU and can be executed easily with 100000 threads:
e :: MVar () -> MVar () -> IO ()e start end = do takeMVar start putMVar end ()main = do args <- getArgs let (numThreads, sleep) = case args of numS:sleepS:[] -> (read numS :: Int, read sleepS :: Int) _ -> error "wrong args" starts <- mapM (const newEmptyMVar ) [1..numThreads] ends <- mapM (const newEmptyMVar ) [1..numThreads] mapM_ (\ (start,end) -> forkIO $ e start end) (zip starts ends) mapM_ (\ start -> putMVar start () ) starts putStrLn "All threads created" threadDelay (sleep * 1000 * 1000) mapM_ (\ end -> takeMVar end ) ends putStrLn "All done"
And the results:
129,270,632 bytes allocated in the heap 404,154,872 bytes copied during GC 77,844,160 bytes maximum residency (10 sample(s)) 10,929,688 bytes maximum slop 165 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 128 colls, 128 par 0.178s 0.079s 0.0006s 0.0152s Gen 1 10 colls, 9 par 0.367s 0.137s 0.0137s 0.0325s Parallel GC work balance: 50.09% (serial 0%, perfect 100%) TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.001s elapsed) MUT time 0.189s ( 10.094s elapsed) GC time 0.545s ( 0.217s elapsed) EXIT time 0.001s ( 0.002s elapsed) Total time 0.735s ( 10.313s elapsed) Alloc rate 685,509,460 bytes per MUT second Productivity 25.9% of total user, 1.8% of total elapsed
On my i5, it takes less than one second to create the 100000 threads and put the "start" mvar. The peak residency is at around 778 bytes per thread, not bad at all!
Checking threadDelay's implementation, we see that it is effectively different for the threaded and unthreaded case:
https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Conc.IO.html#threadDelay
Then here: https://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Event.TimerManager.html
which looks innocent enough. But an older version of base has an arcane spelling of (memory) doom for those that invoke threadDelay:
https://hackage.haskell.org/package/base-4.4.0.0/docs/src/GHC-Event-Manager.html#line-121
If there is still an issue or not, it is hard to say. However, one can always hope that a "real life" concurrent program won't need to have too many threads waiting on threadDelay at the same time. I for one will keep an eye on my usage of threadDelay from now on.