With Haskell, how do I process large volumes of XML?

xml haskell tag-soup large-scale large-data

I notice you're doing String IO in all these cases. You absolutely must use either Data.Text or Data.Bytestring(.Lazy) if you hope to process large volumes of text efficiently, as String == [Char], which is an inappropriate representation for very large flat files.

That then implies you'll need to use a Haskell XML library that supports bytestrings. The couple-of-dozen xml libraries are here: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml

I'm not sure which support bytestrings, but that's the condition you're looking for.

xml haskell tag-soup large-scale large-data

Below is an example that uses hexpat:

{-# LANGUAGE PatternGuards #-}module Main whereimport Text.XML.Expat.SAXimport qualified Data.ByteString.Lazy as Buserid = "83805"main :: IO ()main = B.readFile "posts.xml" >>= print . earliest  where earliest :: B.ByteString -> SAXEvent String String        earliest = head . filter (ownedBy userid) . parse opts        opts = ParserOptions Nothing NothingownedBy :: String -> SAXEvent String String -> BoolownedBy uid (StartElement "row" as)  | Just ouid <- lookup "OwnerUserId" as = ouid == uid  | otherwise = FalseownedBy _ _ = False

The definition of ownedBy is a little clunky. Maybe a view pattern instead:

{-# LANGUAGE ViewPatterns #-}module Main whereimport Text.XML.Expat.SAXimport qualified Data.ByteString.Lazy as Buserid = "83805"main :: IO ()main = B.readFile "posts.xml" >>= print . earliest  where earliest :: B.ByteString -> SAXEvent String String        earliest = head . filter (ownedBy userid) . parse opts        opts = ParserOptions Nothing NothingownedBy :: String -> SAXEvent String String -> BoolownedBy uid (ownerUserId -> Just ouid) = uid == ouidownedBy _ _ = FalseownerUserId :: SAXEvent String String -> Maybe StringownerUserId (StartElement "row" as) = lookup "OwnerUserId" asownerUserId _ = Nothing

xml haskell tag-soup large-scale large-data

You could try my fast-tagsoup library. It's a simple replacement to tagsoup and parses at speeds of 20-200MB/sec.

The problem with tagsoup package is that it works with String internally even if you use Text or ByteString interface. fast-tagsoup works with strict ByteStrings using high-performance low-level parsing while still returning lazy tags list as output.

CodeHunter

With Haskell, how do I process large volumes of XML?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last