Haskell performance implementing unix's "cat" program with Data.ByteString Haskell performance implementing unix's "cat" program with Data.ByteString unix unix

Haskell performance implementing unix's "cat" program with Data.ByteString


This is only a partial answer trying to address the second question:

I tried something like this using GHC.IO.Buffer API:

module Main whereimport System.IOimport System.Environmentimport GHC.IO.Bufferimport Data.ByteString as BSimport Control.Monad-- Copied from cat source codebufsize = 1024*128go handle bufPtr = do  read <- hGetBuf handle bufPtr bufsize  when (read > 0) $ do    hPutBuf stdout bufPtr read    go handle bufPtrmain = do  file    <- fmap Prelude.head getArgs  handle  <- openFile file ReadMode  buf     <- newByteBuffer bufsize WriteBuffer  withBuffer buf $ go handle

and it seems to come closer to the performance of 'cat', but still definitely slower...

time ./Cat huge > /dev/null ./Cat huge > /dev/null  0.00s user 0.06s system 76% cpu 0.081 totaltime cat huge > /dev/null  cat huge > /dev/null  0.00s user 0.05s system 75% cpu 0.063 total

I think using the buffer API, we can explicitly avoid allocating all the buffer bytestrings when using hGetSome like in the original code, but I am just guessing here and don't know either what exactly is happening in both compiled codes...

UPDATE: Adding the original code's performance on my laptop:

time ./Cat2 huge > /dev/null./Cat2 huge > /dev/null  0.12s user 0.10s system 99% cpu 0.219 total

UPDATE 2: Adding some basic profiling results:

Original Code:

Cat2 +RTS -p -RTS huge    total time  =        0.21 secs   (211 ticks @ 1000 us, 1 processor)    total alloc = 6,954,068,112 bytes  (excludes profiling overheads)COST CENTRE MODULE  %time %allocMAIN        MAIN    100.0  100.0                                                       individual     inheritedCOST CENTRE MODULE                   no.     entries  %time %alloc   %time %allocMAIN        MAIN                      46           0  100.0  100.0   100.0  100.0 CAF        GHC.IO.Handle.FD          86           0    0.0    0.0     0.0    0.0 CAF        GHC.Conc.Signal           82           0    0.0    0.0     0.0    0.0 CAF        GHC.IO.Encoding           80           0    0.0    0.0     0.0    0.0 CAF        GHC.IO.FD                 79           0    0.0    0.0     0.0    0.0 CAF        System.Posix.Internals    75           0    0.0    0.0     0.0    0.0 CAF        GHC.IO.Encoding.Iconv     72           0    0.0    0.0     0.0    0.0

Buffer-API Code:

Cat +RTS -p -RTS huge    total time  =        0.06 secs   (61 ticks @ 1000 us, 1 processor)    total alloc =   3,487,712 bytes  (excludes profiling overheads)COST CENTRE MODULE  %time %allocMAIN        MAIN    100.0   98.9                                                      individual     inheritedCOST CENTRE MODULE                  no.     entries  %time %alloc   %time %allocMAIN        MAIN                     44           0  100.0   98.9   100.0  100.0 CAF        GHC.IO.Handle.FD         85           0    0.0    1.0     0.0    1.0 CAF        GHC.Conc.Signal          82           0    0.0    0.0     0.0    0.0 CAF        GHC.IO.Encoding          80           0    0.0    0.1     0.0    0.1 CAF        GHC.IO.FD                79           0    0.0    0.0     0.0    0.0 CAF        GHC.IO.Encoding.Iconv    71           0    0.0    0.0     0.0    0.0

Notice especially the big difference in allocation costs...


The original question made me think it was about finding a performance issue in the exact code provided. Since the comment "I'm hoping to go for a more idiomatic / "high level" Haskell solution" contradicts that assumption, I'll give the reasonably performing idiomatic Haskell solution.

The way I would expect any random programmer familiar with Haskell to solve this problem is with Lazy bytestrings. This allows the programmer to simply specify the task of reading input and putting output while letting the compiler worry about mucking with the buffering and looping constructs.

module Main where

import System.IOimport System.Environmentimport Data.ByteString.Lazy as BSimport Control.Monadmain :: IO ()main = do  file    <- fmap Prelude.head getArgs  handle  <- openFile file ReadMode  buf     <- BS.hGetContents handle  hPut stdout buf

The result is both more readable and better performing than the code in the original question:

timing 'cat'real    0m0.075suser    0m0.000ssys     0m0.074stiming strict bytestring with GHC -O2real    0m0.254suser    0m0.126ssys     0m0.127stiming strict bytestring with GHC -O2 -fllvmreal    0m0.267suser    0m0.132ssys     0m0.134stiming lazy bytestring with GHC -O2real    0m0.091suser    0m0.023ssys     0m0.067stiming lazy bytestring with GHC -O2 -fllvmreal    0m0.091suser    0m0.021ssys     0m0.069s

That is, the lazy bytestring solution is 21% slower than cat. Putting cat last for preferential caching behavior results in 59ms runtime placing the Haskell solution at 51% slower.

EDIT: Dons suggested using memory mapped IO would more accurately model cat's behavior. I'm not sure how accurate that statement is but mmap almost always results in better performance and this situation is certainly no exception:

timing memory mapped lazy bytestring with GHC -O2real    0m0.008suser    0m0.004ssys     0m0.003s

Which was produced by:

module Main whereimport System.IO (stdout)import System.Environmentimport System.IO.Posix.MMap.Lazyimport Data.ByteString.Lazy (hPut)import Control.Monadmain :: IO ()main = do  file    <- fmap Prelude.head getArgs  buf     <- unsafeMMapFile file  hPut stdout buf


Remark post festum:

I'm not sure what the question is now that people have booted it around a bit. I wanted to see what was up with bytestring-mmap, and so I made a pipes version to 'correct' its lazy bytestring module. https://github.com/michaelt/pipes-bytestring-mmap In consequence I assembled all of these programs, using sibis test method. The only two of the modules in https://github.com/michaelt/pipes-bytestring-mmap/tree/master/bench that seem like anything but dumb bread-and-butter haskell are the ones that use fancy explicit buffer management.

Anyway, here are some results: File size grows by 10* as we move to the right. It is interesting to see how far the programs differ at different file sizes. The programs that don't use mmap only start to show their character as 'linear in the length of the file' at 420M. At that point, and after, they are all almost exactly the same, suggesting that the rather divergent behavior at smaller sizes can't be taken too seriously. The mmap files all behave similarly (to each other) with a few curiosities (which I replicated) All this is on os x.

4200000           42000000          420000000         4200000000timing 'cat'real  0m0.006s    real  0m0.013s    real  0m0.919s    real  0m8.154suser  0m0.002s    user  0m0.002s    user  0m0.005s    user  0m0.028ssys   0m0.003s    sys   0m0.009s    sys   0m0.223s    sys   0m2.179stiming lazy bytestring - idiomatic Haskell (following Thomas M. DuBuisson) real  0m0.009s    real  0m0.025s    real  0m0.894s    real  0m9.146suser  0m0.002s    user  0m0.006s    user  0m0.078s    user  0m0.787ssys   0m0.005s    sys   0m0.016s    sys   0m0.288s    sys   0m3.001stiming fancy buffering following statusfailedreal  0m0.014s    real  0m0.066s    real  0m0.876s    real  0m8.686suser  0m0.005s    user  0m0.028s    user  0m0.278s    user  0m2.724ssys   0m0.007s    sys   0m0.035s    sys   0m0.424s    sys   0m4.232stiming fancier use of GHC.Buf following bmkreal  0m0.011s    real  0m0.018s    real  0m0.831s    real  0m8.218suser  0m0.002s    user  0m0.003s    user  0m0.034s    user  0m0.289ssys   0m0.006s    sys   0m0.013s    sys   0m0.236s    sys   0m2.447stiming Pipes.ByteString following sibireal  0m0.012s    real  0m0.020s    real  0m0.845s    real  0m8.241suser  0m0.003s    user  0m0.004s    user  0m0.020s    user  0m0.175ssys   0m0.007s    sys   0m0.014s    sys   0m0.239s    sys   0m2.509s

Then with mmap

timing Lazy.MMap following dons and Thomas M. DuBuisson real  0m0.006s    real  0m0.006s    real  0m0.037s    real  0m0.133suser  0m0.002s    user  0m0.002s    user  0m0.006s    user  0m0.051ssys   0m0.003s    sys   0m0.003s    sys   0m0.013s    sys   0m0.061timing Pipes.ByteString.MMap with SafeT machineryreal  0m0.006s    real  0m0.010s    real  0m0.051s    real  0m0.196suser  0m0.002s    user  0m0.004s    user  0m0.012s    user  0m0.099ssys   0m0.003s    sys   0m0.005s    sys   0m0.016s    sys   0m0.072stiming Pipes.ByteString.MMap 'withFile' stylereal  0m0.008s    real  0m0.008s    real  0m0.142s    real  0m0.134suser  0m0.002s    user  0m0.002s    user  0m0.007s    user  0m0.046ssys   0m0.004s    sys   0m0.004s    sys   0m0.016s    sys   0m0.066s