Map HTML to JSON [closed]
I just wrote this function that does what you want; try it out let me know if it doesn't work correctly for you:
// Test with an element.var initElement = document.getElementsByTagName("html")[0];var json = mapDOM(initElement, true);console.log(json);// Test with a string.initElement = "<div><span>text</span>Text2</div>";json = mapDOM(initElement, true);console.log(json);function mapDOM(element, json) { var treeObject = {}; // If string convert to document Node if (typeof element === "string") { if (window.DOMParser) { parser = new DOMParser(); docNode = parser.parseFromString(element,"text/xml"); } else { // Microsoft strikes again docNode = new ActiveXObject("Microsoft.XMLDOM"); docNode.async = false; docNode.loadXML(element); } element = docNode.firstChild; } //Recursively loop through DOM elements and assign properties to object function treeHTML(element, object) { object["type"] = element.nodeName; var nodeList = element.childNodes; if (nodeList != null) { if (nodeList.length) { object["content"] = []; for (var i = 0; i < nodeList.length; i++) { if (nodeList[i].nodeType == 3) { object["content"].push(nodeList[i].nodeValue); } else { object["content"].push({}); treeHTML(nodeList[i], object["content"][object["content"].length -1]); } } } } if (element.attributes != null) { if (element.attributes.length) { object["attributes"] = {}; for (var i = 0; i < element.attributes.length; i++) { object["attributes"][element.attributes[i].nodeName] = element.attributes[i].nodeValue; } } } } treeHTML(element, treeObject); return (json) ? JSON.stringify(treeObject) : treeObject;}
Working example: http://jsfiddle.net/JUSsf/ (Tested in Chrome, I can't guarantee full browser support - you will have to test this).
βIt creates an object that contains the tree structure of the HTML page in the format you requested and then uses JSON.stringify()
which is included in most modern browsers (IE8+, Firefox 3+ .etc); If you need to support older browsers you can include json2.js.
It can take either a DOM element or a string
containing valid XHTML as an argument (I believe, I'm not sure whether the DOMParser()
will choke in certain situations as it is set to "text/xml"
or whether it just doesn't provide error handling. Unfortunately "text/html"
has poor browser support).
You can easily change the range of this function by passing a different value as element
. Whatever value you pass will be the root of your JSON map.
Representing complex HTML documents will be difficult and full of corner cases, but I just wanted to share a couple techniques to show how to get this kind of program started. This answer differs in that it uses data abstraction and the toJSON
method to recursively build the result
Below, html2json
is a tiny function which takes an HTML node as input and it returns a JSON string as the result. Pay particular attention to how the code is quite flat but it's still plenty capable of building a deeply nested tree structure β all possible with virtually zero complexity
// data Elem = Elem Nodeconst Elem = e => ({ toJSON : () => ({ tagName: e.tagName, textContent: e.textContent, attributes: Array.from(e.attributes, ({name, value}) => [name, value]), children: Array.from(e.children, Elem) })})// html2json :: Node -> JSONStringconst html2json = e => JSON.stringify(Elem(e), null, ' ') console.log(html2json(document.querySelector('main')))
<main> <h1 class="mainHeading">Some heading</h1> <ul id="menu"> <li><a href="/a">a</a></li> <li><a href="/b">b</a></li> <li><a href="/c">c</a></li> </ul> <p>some text</p></main>
In the previous example, the textContent
gets a little butchered. To remedy this, we introduce another data constructor, TextElem
. We'll have to map over the childNodes
(instead of children
) and choose to return the correct data type based on e.nodeType
β this gets us a littler closer to what we might need
// data Elem = Elem Node | TextElem Nodeconst TextElem = e => ({ toJSON: () => ({ type: 'TextElem', textContent: e.textContent })})const Elem = e => ({ toJSON : () => ({ type: 'Elem', tagName: e.tagName, attributes: Array.from(e.attributes, ({name, value}) => [name, value]), children: Array.from(e.childNodes, fromNode) })})// fromNode :: Node -> Elemconst fromNode = e => { switch (e.nodeType) { case 3: return TextElem(e) default: return Elem(e) }}// html2json :: Node -> JSONStringconst html2json = e => JSON.stringify(Elem(e), null, ' ') console.log(html2json(document.querySelector('main')))
<main> <h1 class="mainHeading">Some heading</h1> <ul id="menu"> <li><a href="/a">a</a></li> <li><a href="/b">b</a></li> <li><a href="/c">c</a></li> </ul> <p>some text</p></main>
Anyway, that's just two iterations on the problem. Of course you'll have to address corner cases where they come up, but what's nice about this approach is that it gives you a lot of flexibility to encode the HTML however you wish in JSON β and without introducing too much complexity
In my experience, you could keep iterating with this technique and achieve really good results. If this answer is interesting to anyone and would like me to expand upon anything, let me know ^_^
Related: Recursive methods using JavaScript: building your own version of JSON.stringify
I got few links sometime back while reading on ExtJS full framework in itself is JSON.
http://www.thomasfrank.se/xml_to_json.html
http://camel.apache.org/xmljson.html
online XML to JSON converter : http://jsontoxml.utilities-online.info/
UPDATEBTW, To get JSON as added in question, HTML need to have type & content tags in it too like this or you need to use some xslt transformation to add these elements while doing JSON conversion
<?xml version="1.0" encoding="UTF-8" ?><type>div</type><content> <type>span</type> <content>Text2</content></content><content>Text2</content>