Convert XML file to CSV in java Convert XML file to CSV in java xml xml

Convert XML file to CSV in java


The code provided should be considered a sketch rather than the definitive article. I am not an expert on SAX and the implementation could be improved for better performance, simpler code etc. That said SAX should be able to cope with streaming large XML files.

I would approach this problem with 2 passes using the SAX parser. (Incidentally, I would also use a CSV generating library to create the output as this would deal with all the fiddly character escaping that CSV involves but I haven't implemented this in my sketch).

First pass:Establish number of header columns

Second pass:Output CSV

I assume that the XML file is well formed. I assume that we don't have a scheme/DTD with a predefined order.

In the first pass I have assumed that a CSV column will be added for every XML element containing text content or for any attribute (I have assumed attributes will contain something!).

The second pass, having established the number of target columns, will do the actual CSV output.

Based on your example XML my code sketch would produce:

ItemID,StartTime,EndTime,ViewItemURL,AverageTime,category,category,type,type,AveragePrice4504216603,10:00:10.000Z,10:00:30.000Z,http://url,,,,,,4504216604,10:30:10.000Z,11:00:10.000Z,http://url,value1,9823,9112,TX,TY,value2

Please note I have used the google collections LinkedHashMultimap as this is helpful when associating multiple values with a single key. I hope you find this useful!

import com.google.common.collect.LinkedHashMultimap;import java.io.FileNotFoundException;import java.io.FileReader;import java.io.IOException;import java.util.LinkedHashMap;import java.util.Map.Entry;import org.xml.sax.Attributes;import org.xml.sax.InputSource;import org.xml.sax.SAXException;import org.xml.sax.XMLReader;import org.xml.sax.helpers.DefaultHandler;import org.xml.sax.helpers.XMLReaderFactory;public class App {    public static void main(String[] args) throws SAXException, FileNotFoundException, IOException {        // First pass - to determine headers        XMLReader xr = XMLReaderFactory.createXMLReader();        HeaderHandler handler = new HeaderHandler();        xr.setContentHandler(handler);        xr.setErrorHandler(handler);        FileReader r = new FileReader("test1.xml");        xr.parse(new InputSource(r));        LinkedHashMap<String, Integer> headers = handler.getHeaders();        int totalnumberofcolumns = 0;        for (int headercount : headers.values()) {            totalnumberofcolumns += headercount;        }        String[] columnheaders = new String[totalnumberofcolumns];        int i = 0;        for (Entry<String, Integer> entry : headers.entrySet()) {            for (int j = 0; j < entry.getValue(); j++) {                columnheaders[i] = entry.getKey();                i++;            }        }        StringBuilder sb = new StringBuilder();        for (String h : columnheaders) {            sb.append(h);            sb.append(',');        }        System.out.println(sb.substring(0, sb.length() - 1));        // Second pass - collect and output data        xr = XMLReaderFactory.createXMLReader();        DataHandler datahandler = new DataHandler();        datahandler.setHeaderArray(columnheaders);        xr.setContentHandler(datahandler);        xr.setErrorHandler(datahandler);        r = new FileReader("test1.xml");        xr.parse(new InputSource(r));    }    public static class HeaderHandler extends DefaultHandler {        private String content;        private String currentElement;        private boolean insideElement = false;        private Attributes attribs;        private LinkedHashMap<String, Integer> itemHeader;        private LinkedHashMap<String, Integer> accumulativeHeader = new LinkedHashMap<String, Integer>();        public HeaderHandler() {            super();        }        private LinkedHashMap<String, Integer> getHeaders() {            return accumulativeHeader;        }        private void addItemHeader(String headerName) {            if (itemHeader.containsKey(headerName)) {                itemHeader.put(headerName, itemHeader.get(headerName) + 1);            } else {                itemHeader.put(headerName, 1);            }        }        @Override        public void startElement(String uri, String name,                String qName, Attributes atts) {            if ("item".equalsIgnoreCase(qName)) {                itemHeader = new LinkedHashMap<String, Integer>();            }            currentElement = qName;            content = null;            insideElement = true;            attribs = atts;        }        @Override        public void endElement(String uri, String name, String qName) {            if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) {                if (content != null && qName.equals(currentElement) && content.trim().length() > 0) {                    addItemHeader(qName);                }                if (attribs != null) {                    int attsLength = attribs.getLength();                    if (attsLength > 0) {                        for (int i = 0; i < attsLength; i++) {                            String attName = attribs.getLocalName(i);                            addItemHeader(attName);                        }                    }                }            }            if ("item".equalsIgnoreCase(qName)) {                for (Entry<String, Integer> entry : itemHeader.entrySet()) {                    String headerName = entry.getKey();                    Integer count = entry.getValue();                    //System.out.println(entry.getKey() + ":" + entry.getValue());                    if (accumulativeHeader.containsKey(headerName)) {                        if (count > accumulativeHeader.get(headerName)) {                            accumulativeHeader.put(headerName, count);                        }                    } else {                        accumulativeHeader.put(headerName, count);                    }                }            }            insideElement = false;            currentElement = null;            attribs = null;        }        @Override        public void characters(char ch[], int start, int length) {            if (insideElement) {                content = new String(ch, start, length);            }        }    }    public static class DataHandler extends DefaultHandler {        private String content;        private String currentElement;        private boolean insideElement = false;        private Attributes attribs;        private LinkedHashMultimap dataMap;        private String[] headerArray;        public DataHandler() {            super();        }        @Override        public void startElement(String uri, String name,                String qName, Attributes atts) {            if ("item".equalsIgnoreCase(qName)) {                dataMap = LinkedHashMultimap.create();            }            currentElement = qName;            content = null;            insideElement = true;            attribs = atts;        }        @Override        public void endElement(String uri, String name, String qName) {            if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) {                if (content != null && qName.equals(currentElement) && content.trim().length() > 0) {                    dataMap.put(qName, content);                }                if (attribs != null) {                    int attsLength = attribs.getLength();                    if (attsLength > 0) {                        for (int i = 0; i < attsLength; i++) {                            String attName = attribs.getLocalName(i);                            dataMap.put(attName, attribs.getValue(i));                        }                    }                }            }            if ("item".equalsIgnoreCase(qName)) {                String data[] = new String[headerArray.length];                int i = 0;                for (String h : headerArray) {                    if (dataMap.containsKey(h)) {                        Object[] values = dataMap.get(h).toArray();                        data[i] = (String) values[0];                        if (values.length > 1) {                            dataMap.removeAll(h);                            for (int j = 1; j < values.length; j++) {                                dataMap.put(h, values[j]);                            }                        } else {                            dataMap.removeAll(h);                        }                    } else {                        data[i] = "";                    }                    i++;                }                StringBuilder sb = new StringBuilder();                for (String d : data) {                    sb.append(d);                    sb.append(',');                }                System.out.println(sb.substring(0, sb.length() - 1));            }            insideElement = false;            currentElement = null;            attribs = null;        }        @Override        public void characters(char ch[], int start, int length) {            if (insideElement) {                content = new String(ch, start, length);            }        }        public void setHeaderArray(String[] headerArray) {            this.headerArray = headerArray;        }    }}


This looks like a good case for using XSL. Given your basic requirements it may be easier to get at the right nodes with XSL as compared to custom parsers or serializers. The benefit would be that your XSL could target "//Item//AverageTime" or whatever nodes you require without worrying about node depth.

UPDATE: The following is the xslt I threw together to make sure this worked as expected.

<?xml version="1.0"?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"><xsl:output method="text" /><xsl:template match="/">ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice<xsl:for-each select="//Item"><xsl:value-of select="ItemID"/><xsl:text>,</xsl:text><xsl:value-of select="//StartTime"/><xsl:text>,</xsl:text><xsl:value-of select="//EndTime"/><xsl:text>,</xsl:text><xsl:value-of select="//ViewItemURL"/><xsl:text>,</xsl:text><xsl:value-of select="//AverageTime"/><xsl:text>,</xsl:text><xsl:value-of select="//AveragePrice"/><xsl:text></xsl:text></xsl:for-each></xsl:template></xsl:stylesheet>


I'm not sure I understand how generic the solution should be. Do you really want to parse a 1 GB file twice for a generic solution? And if you want something generic, why did you skipped the <category> element in your example? How much different format do you need to handle? Do you really not know what the format can be (even if some element can be ommited)? Can you clarify?

To my experience, it's generally preferable to parse specific files in a specific way (this doesn't exclude using a generic API though). My answer will go in this direction (and I'll update it after the clarification).


If you don't feel comfortable with XML, you could consider using some existing (commercial) libraries, for example Ricebridge XML Manager and CSV Manager. See How to convert CSV into XML and XML into CSV using Java for a full example. The approach is pretty straightforward: you define the data fields using XPath expressions (which is perfect in your case since you can have "extra" elements), parse the the file and then pass the result List to the CSV component to generate the CSV file. The API looks simple, the code tested (the source code of their test cases is available under a BSD-style license), they claim supporting gigabyte-sized files.

You can get a Single Developer license for $170 which is not very expensive compared to developer daily rates.

They offer 30 days trial versions, have a look.


Another option would be to use Spring Batch. Spring batch offers everything required to work with XML files as input or output (using StAX and the XML binding framework of your choice) and flat files as input or output. See:


You could also use Smooks to do XML to CSV transformations. See also:


Another option would be to roll your own solution, using a StAX parser or, why not, using VTD-XML and XPath. Have a look at: