Normalization in DOM parsing with java - how does it work?

The rest of the sentence is:

where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes.

This basically means that the following XML element

<foo>hello world</foo>

could be represented like this in a denormalized node:

Element foo    Text node: ""    Text node: "Hello "    Text node: "wor"    Text node: "ld"

When normalized, the node will look like this

Element foo    Text node: "Hello world"

And the same goes for attributes: <foo bar="Hello world"/>, comments, etc.

java xml dom

In simple, Normalisation is Reduction of Redundancies.
Examples of Redundancies:
a) white spaces outside of the root/document tags(...<document></document>...)
b) white spaces within start tag (<...>) and end tag (</...>)
c) white spaces between attributes and their values (ie. spaces between key name and =")
d) superfluous namespace declarations
e) line breaks/white spaces in texts of attributes and tags
f) comments etc...

java xml dom

As an extension to @JBNizet's answer for more technical users here's what implementation of org.w3c.dom.Node interface in com.sun.org.apache.xerces.internal.dom.ParentNode looks like, gives you the idea how it actually works.

public void normalize() {    // No need to normalize if already normalized.    if (isNormalized()) {        return;    }    if (needsSyncChildren()) {        synchronizeChildren();    }    ChildNode kid;    for (kid = firstChild; kid != null; kid = kid.nextSibling) {         kid.normalize();    }    isNormalized(true);}

It traverses all the nodes recursively and calls kid.normalize()
This mechanism is overridden in org.apache.xerces.dom.ElementImpl

public void normalize() {     // No need to normalize if already normalized.     if (isNormalized()) {         return;     }     if (needsSyncChildren()) {         synchronizeChildren();     }     ChildNode kid, next;     for (kid = firstChild; kid != null; kid = next) {         next = kid.nextSibling;         // If kid is a text node, we need to check for one of two         // conditions:         //   1) There is an adjacent text node         //   2) There is no adjacent text node, but kid is         //      an empty text node.         if ( kid.getNodeType() == Node.TEXT_NODE )         {             // If an adjacent text node, merge it with kid             if ( next!=null && next.getNodeType() == Node.TEXT_NODE )             {                 ((Text)kid).appendData(next.getNodeValue());                 removeChild( next );                 next = kid; // Don't advance; there might be another.             }             else             {                 // If kid is empty, remove it                 if ( kid.getNodeValue() == null || kid.getNodeValue().length() == 0 ) {                     removeChild( kid );                 }             }         }         // Otherwise it might be an Element, which is handled recursively         else if (kid.getNodeType() == Node.ELEMENT_NODE) {             kid.normalize();         }     }     // We must also normalize all of the attributes     if ( attributes!=null )     {         for( int i=0; i<attributes.getLength(); ++i )         {             Node attr = attributes.item(i);             attr.normalize();         }     }    // changed() will have occurred when the removeChild() was done,    // so does not have to be reissued.     isNormalized(true); }

Hope this saves you some time.

CodeHunter

Normalization in DOM parsing with java - how does it work?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last