Huge XML in Clojure Huge XML in Clojure xml xml

Huge XML in Clojure


I used the new clojure.data.xml to process a 31GB Wikipedia dump on a modest laptop. The old lazy-xml contrib library did not work for me (ran out of memory).

https://github.com/clojure/data.xml

Simplified example code:

(require '[clojure.data.xml :as data.xml]) ;'(defn process-page [page]  ;; ...  )(defn page-seq [rdr]  (->> (:content (data.xml/parse rdr))       (filter #(= :page (:tag %)))       (map process-page)))


processing huge xml is usually done with SAX, in case of Clojure this is http://richhickey.github.com/clojure-contrib/lazy-xml-api.html

see(parse-seq File/InputStream/URI)


If the xml is a collection of records, https://github.com/marktriggs/xml-picker-seq is what you need to process records in xml regardless of the xml size. It uses XOM under the hood and processes one 'record' at a time.