What is the best way to parse (big) XML in C# Code?
Use XmlReader
to parse large XML documents. XmlReader
provides fast, forward-only, non-cached access to XML data. (Forward-only means you can read the XML file from beginning to end but cannot move backwards in the file.) XmlReader
uses small amounts of memory, and is equivalent to using a simple SAX reader.
using (XmlReader myReader = XmlReader.Create(@"c:\data\coords.xml")) { while (myReader.Read()) { // Process each node (myReader.Value) here // ... } }
You can use XmlReader to process files that are up to 2 gigabytes (GB) in size.
Asat 14 May 2009: I've switched to using a hybrid approach... see code below.
This version has most of the advantages of both:
* the XmlReader/XmlTextReader (memory efficiency --> speed); and
* the XmlSerializer (code-gen --> development expediancy and flexibility).
It uses the XmlTextReader to iterate through the document, and creates "doclets" which it deserializes using the XmlSerializer and "XML binding" classes generated with XSD.EXE.
I guess this recipe is universally applicable, and it's fast... I'm parsing a 201 MB XML Document containing 56,000 GML Features in about 7 seconds... the old VB6 implementation of this application took minutes (or even hours) to parse large extracts... so I'm lookin' good to go.
Once again, a BIG Thank You to the forumites for donating your valuable time. I really appreciate it.
Cheers all. Keith.
using System;using System.Reflection;using System.Xml;using System.Xml.Serialization;using System.IO;using System.Collections.Generic;using nrw_rime_extract.utils;using nrw_rime_extract.xml.generated_bindings;namespace nrw_rime_extract.xml{ internal interface ExtractXmlReader { rimeType read(string xmlFilename); } /// <summary> /// RimeExtractXml provides bindings to the RIME Extract XML as defined by /// $/Release 2.7/Documentation/Technical/SCHEMA and DTDs/nrw-rime-extract.xsd /// </summary> internal class ExtractXmlReader_XmlSerializerImpl : ExtractXmlReader { private Log log = Log.getInstance(); public rimeType read(string xmlFilename) { log.write( string.Format( "DEBUG: ExtractXmlReader_XmlSerializerImpl.read({0})", xmlFilename)); using (Stream stream = new FileStream(xmlFilename, FileMode.Open)) { return read(stream); } } internal rimeType read(Stream xmlInputStream) { // create an instance of the XmlSerializer class, // specifying the type of object to be deserialized. XmlSerializer serializer = new XmlSerializer(typeof(rimeType)); serializer.UnknownNode += new XmlNodeEventHandler(handleUnknownNode); serializer.UnknownAttribute += new XmlAttributeEventHandler(handleUnknownAttribute); // use the Deserialize method to restore the object's state // with data from the XML document. return (rimeType)serializer.Deserialize(xmlInputStream); } protected void handleUnknownNode(object sender, XmlNodeEventArgs e) { log.write( string.Format( "XML_ERROR: Unknown Node at line {0} position {1} : {2}\t{3}", e.LineNumber, e.LinePosition, e.Name, e.Text)); } protected void handleUnknownAttribute(object sender, XmlAttributeEventArgs e) { log.write( string.Format( "XML_ERROR: Unknown Attribute at line {0} position {1} : {2}='{3}'", e.LineNumber, e.LinePosition, e.Attr.Name, e.Attr.Value)); } } /// <summary> /// xtractXmlReader provides bindings to the extract.xml /// returned by the RIME server; as defined by: /// $/Release X/Documentation/Technical/SCHEMA and /// DTDs/nrw-rime-extract.xsd /// </summary> internal class ExtractXmlReader_XmlTextReaderXmlSerializerHybridImpl : ExtractXmlReader { private Log log = Log.getInstance(); public rimeType read(string xmlFilename) { log.write( string.Format( "DEBUG: ExtractXmlReader_XmlTextReaderXmlSerializerHybridImpl." + "read({0})", xmlFilename)); using (XmlReader reader = XmlReader.Create(xmlFilename)) { return read(reader); } } public rimeType read(XmlReader reader) { rimeType result = new rimeType(); // a deserializer for featureClass, feature, etc, "doclets" Dictionary<Type, XmlSerializer> serializers = new Dictionary<Type, XmlSerializer>(); serializers.Add(typeof(featureClassType), newSerializer(typeof(featureClassType))); serializers.Add(typeof(featureType), newSerializer(typeof(featureType))); List<featureClassType> featureClasses = new List<featureClassType>(); List<featureType> features = new List<featureType>(); while (!reader.EOF) { if (reader.MoveToContent() != XmlNodeType.Element) { reader.Read(); // skip non-element-nodes and unknown-elements. continue; } // skip junk nodes. if (reader.Name.Equals("featureClass")) { using ( StringReader elementReader = new StringReader(reader.ReadOuterXml())) { XmlSerializer deserializer = serializers[typeof (featureClassType)]; featureClasses.Add( (featureClassType) deserializer.Deserialize(elementReader)); } continue; // ReadOuterXml advances the reader, so don't read again. } if (reader.Name.Equals("feature")) { using ( StringReader elementReader = new StringReader(reader.ReadOuterXml())) { XmlSerializer deserializer = serializers[typeof (featureType)]; features.Add( (featureType) deserializer.Deserialize(elementReader)); } continue; // ReadOuterXml advances the reader, so don't read again. } log.write( "WARNING: unknown element '" + reader.Name + "' was skipped during parsing."); reader.Read(); // skip non-element-nodes and unknown-elements. } result.featureClasses = featureClasses.ToArray(); result.features = features.ToArray(); return result; } private XmlSerializer newSerializer(Type elementType) { XmlSerializer serializer = new XmlSerializer(elementType); serializer.UnknownNode += new XmlNodeEventHandler(handleUnknownNode); serializer.UnknownAttribute += new XmlAttributeEventHandler(handleUnknownAttribute); return serializer; } protected void handleUnknownNode(object sender, XmlNodeEventArgs e) { log.write( string.Format( "XML_ERROR: Unknown Node at line {0} position {1} : {2}\t{3}", e.LineNumber, e.LinePosition, e.Name, e.Text)); } protected void handleUnknownAttribute(object sender, XmlAttributeEventArgs e) { log.write( string.Format( "XML_ERROR: Unknown Attribute at line {0} position {1} : {2}='{3}'", e.LineNumber, e.LinePosition, e.Attr.Name, e.Attr.Value)); } }}
Just to summarise, and make the answer a bit more obvious for anyone who finds this thread in google.
Prior to .NET 2 the XmlTextReader was the most memory efficient XML parser available in the standard API (thanx Mitch;-)
.NET 2 introduced the XmlReader class which is better again It's a forward-only element iterator (a bit like a StAX parser). (thanx Cerebrus;-)
And remember kiddies, of any XML instance has the potential to be bigger than about 500k, DON'T USE DOM!
Cheers all. Keith.