With Haskell, how do I process large volumes of XML? With Haskell, how do I process large volumes of XML? xml xml

With Haskell, how do I process large volumes of XML?


I notice you're doing String IO in all these cases. You absolutely must use either Data.Text or Data.Bytestring(.Lazy) if you hope to process large volumes of text efficiently, as String == [Char], which is an inappropriate representation for very large flat files.

That then implies you'll need to use a Haskell XML library that supports bytestrings. The couple-of-dozen xml libraries are here: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml

I'm not sure which support bytestrings, but that's the condition you're looking for.


Below is an example that uses hexpat:

{-# LANGUAGE PatternGuards #-}module Main whereimport Text.XML.Expat.SAXimport qualified Data.ByteString.Lazy as Buserid = "83805"main :: IO ()main = B.readFile "posts.xml" >>= print . earliest  where earliest :: B.ByteString -> SAXEvent String String        earliest = head . filter (ownedBy userid) . parse opts        opts = ParserOptions Nothing NothingownedBy :: String -> SAXEvent String String -> BoolownedBy uid (StartElement "row" as)  | Just ouid <- lookup "OwnerUserId" as = ouid == uid  | otherwise = FalseownedBy _ _ = False

The definition of ownedBy is a little clunky. Maybe a view pattern instead:

{-# LANGUAGE ViewPatterns #-}module Main whereimport Text.XML.Expat.SAXimport qualified Data.ByteString.Lazy as Buserid = "83805"main :: IO ()main = B.readFile "posts.xml" >>= print . earliest  where earliest :: B.ByteString -> SAXEvent String String        earliest = head . filter (ownedBy userid) . parse opts        opts = ParserOptions Nothing NothingownedBy :: String -> SAXEvent String String -> BoolownedBy uid (ownerUserId -> Just ouid) = uid == ouidownedBy _ _ = FalseownerUserId :: SAXEvent String String -> Maybe StringownerUserId (StartElement "row" as) = lookup "OwnerUserId" asownerUserId _ = Nothing


You could try my fast-tagsoup library. It's a simple replacement to tagsoup and parses at speeds of 20-200MB/sec.

The problem with tagsoup package is that it works with String internally even if you use Text or ByteString interface. fast-tagsoup works with strict ByteStrings using high-performance low-level parsing while still returning lazy tags list as output.