Huge XML in Clojure
I used the new clojure.data.xml
to process a 31GB Wikipedia dump on a modest laptop. The old lazy-xml
contrib library did not work for me (ran out of memory).
https://github.com/clojure/data.xml
Simplified example code:
(require '[clojure.data.xml :as data.xml]) ;'(defn process-page [page] ;; ... )(defn page-seq [rdr] (->> (:content (data.xml/parse rdr)) (filter #(= :page (:tag %))) (map process-page)))
processing huge xml is usually done with SAX, in case of Clojure this is http://richhickey.github.com/clojure-contrib/lazy-xml-api.html
see(parse-seq File/InputStream/URI)
If the xml is a collection of records, https://github.com/marktriggs/xml-picker-seq is what you need to process records in xml regardless of the xml size. It uses XOM under the hood and processes one 'record' at a time.