What is the best way to parse (big) XML in C# Code? What is the best way to parse (big) XML in C# Code? xml xml

What is the best way to parse (big) XML in C# Code?


Use XmlReader to parse large XML documents. XmlReader provides fast, forward-only, non-cached access to XML data. (Forward-only means you can read the XML file from beginning to end but cannot move backwards in the file.) XmlReader uses small amounts of memory, and is equivalent to using a simple SAX reader.

    using (XmlReader myReader = XmlReader.Create(@"c:\data\coords.xml"))    {        while (myReader.Read())        {           // Process each node (myReader.Value) here           // ...        }    }

You can use XmlReader to process files that are up to 2 gigabytes (GB) in size.

Ref: How to read XML from a file by using Visual C#


Asat 14 May 2009: I've switched to using a hybrid approach... see code below.

This version has most of the advantages of both:
  * the XmlReader/XmlTextReader (memory efficiency --> speed); and
  * the XmlSerializer (code-gen --> development expediancy and flexibility).

It uses the XmlTextReader to iterate through the document, and creates "doclets" which it deserializes using the XmlSerializer and "XML binding" classes generated with XSD.EXE.

I guess this recipe is universally applicable, and it's fast... I'm parsing a 201 MB XML Document containing 56,000 GML Features in about 7 seconds... the old VB6 implementation of this application took minutes (or even hours) to parse large extracts... so I'm lookin' good to go.

Once again, a BIG Thank You to the forumites for donating your valuable time. I really appreciate it.

Cheers all. Keith.

using System;using System.Reflection;using System.Xml;using System.Xml.Serialization;using System.IO;using System.Collections.Generic;using nrw_rime_extract.utils;using nrw_rime_extract.xml.generated_bindings;namespace nrw_rime_extract.xml{    internal interface ExtractXmlReader    {        rimeType read(string xmlFilename);    }    /// <summary>    /// RimeExtractXml provides bindings to the RIME Extract XML as defined by    /// $/Release 2.7/Documentation/Technical/SCHEMA and DTDs/nrw-rime-extract.xsd    /// </summary>    internal class ExtractXmlReader_XmlSerializerImpl : ExtractXmlReader    {        private Log log = Log.getInstance();        public rimeType read(string xmlFilename)        {            log.write(                string.Format(                    "DEBUG: ExtractXmlReader_XmlSerializerImpl.read({0})",                    xmlFilename));            using (Stream stream = new FileStream(xmlFilename, FileMode.Open))            {                return read(stream);            }        }        internal rimeType read(Stream xmlInputStream)        {            // create an instance of the XmlSerializer class,             // specifying the type of object to be deserialized.            XmlSerializer serializer = new XmlSerializer(typeof(rimeType));            serializer.UnknownNode += new XmlNodeEventHandler(handleUnknownNode);            serializer.UnknownAttribute +=                 new XmlAttributeEventHandler(handleUnknownAttribute);            // use the Deserialize method to restore the object's state            // with data from the XML document.            return (rimeType)serializer.Deserialize(xmlInputStream);        }        protected void handleUnknownNode(object sender, XmlNodeEventArgs e)        {            log.write(                string.Format(                    "XML_ERROR: Unknown Node at line {0} position {1} : {2}\t{3}",                    e.LineNumber, e.LinePosition, e.Name, e.Text));        }        protected void handleUnknownAttribute(object sender, XmlAttributeEventArgs e)        {            log.write(                string.Format(                    "XML_ERROR: Unknown Attribute at line {0} position {1} : {2}='{3}'",                    e.LineNumber, e.LinePosition, e.Attr.Name, e.Attr.Value));        }    }    /// <summary>    /// xtractXmlReader provides bindings to the extract.xml     /// returned by the RIME server; as defined by:    ///   $/Release X/Documentation/Technical/SCHEMA and     /// DTDs/nrw-rime-extract.xsd    /// </summary>    internal class ExtractXmlReader_XmlTextReaderXmlSerializerHybridImpl :        ExtractXmlReader    {        private Log log = Log.getInstance();        public rimeType read(string xmlFilename)        {            log.write(                string.Format(                    "DEBUG: ExtractXmlReader_XmlTextReaderXmlSerializerHybridImpl." +                    "read({0})",                    xmlFilename));            using (XmlReader reader = XmlReader.Create(xmlFilename))            {                return read(reader);            }        }        public rimeType read(XmlReader reader)        {            rimeType result = new rimeType();            // a deserializer for featureClass, feature, etc, "doclets"            Dictionary<Type, XmlSerializer> serializers =                 new Dictionary<Type, XmlSerializer>();            serializers.Add(typeof(featureClassType),                 newSerializer(typeof(featureClassType)));            serializers.Add(typeof(featureType),                 newSerializer(typeof(featureType)));            List<featureClassType> featureClasses = new List<featureClassType>();            List<featureType> features = new List<featureType>();            while (!reader.EOF)            {                if (reader.MoveToContent() != XmlNodeType.Element)                {                    reader.Read(); // skip non-element-nodes and unknown-elements.                    continue;                }                // skip junk nodes.                if (reader.Name.Equals("featureClass"))                {                    using (                        StringReader elementReader =                            new StringReader(reader.ReadOuterXml()))                    {                        XmlSerializer deserializer =                            serializers[typeof (featureClassType)];                        featureClasses.Add(                            (featureClassType)                            deserializer.Deserialize(elementReader));                    }                    continue;                    // ReadOuterXml advances the reader, so don't read again.                }                if (reader.Name.Equals("feature"))                {                    using (                        StringReader elementReader =                            new StringReader(reader.ReadOuterXml()))                    {                        XmlSerializer deserializer =                            serializers[typeof (featureType)];                        features.Add(                            (featureType)                            deserializer.Deserialize(elementReader));                    }                    continue;                    // ReadOuterXml advances the reader, so don't read again.                }                log.write(                    "WARNING: unknown element '" + reader.Name +                    "' was skipped during parsing.");                reader.Read(); // skip non-element-nodes and unknown-elements.            }            result.featureClasses = featureClasses.ToArray();            result.features = features.ToArray();            return result;        }        private XmlSerializer newSerializer(Type elementType)        {            XmlSerializer serializer = new XmlSerializer(elementType);            serializer.UnknownNode += new XmlNodeEventHandler(handleUnknownNode);            serializer.UnknownAttribute +=                 new XmlAttributeEventHandler(handleUnknownAttribute);            return serializer;        }        protected void handleUnknownNode(object sender, XmlNodeEventArgs e)        {            log.write(                string.Format(                    "XML_ERROR: Unknown Node at line {0} position {1} : {2}\t{3}",                    e.LineNumber, e.LinePosition, e.Name, e.Text));        }        protected void handleUnknownAttribute(object sender, XmlAttributeEventArgs e)        {            log.write(                string.Format(                    "XML_ERROR: Unknown Attribute at line {0} position {1} : {2}='{3}'",                    e.LineNumber, e.LinePosition, e.Attr.Name, e.Attr.Value));        }    }}


Just to summarise, and make the answer a bit more obvious for anyone who finds this thread in google.

Prior to .NET 2 the XmlTextReader was the most memory efficient XML parser available in the standard API (thanx Mitch;-)

.NET 2 introduced the XmlReader class which is better again It's a forward-only element iterator (a bit like a StAX parser). (thanx Cerebrus;-)

And remember kiddies, of any XML instance has the potential to be bigger than about 500k, DON'T USE DOM!

Cheers all. Keith.