Sanitize/Rewrite HTML on the Client Side Sanitize/Rewrite HTML on the Client Side javascript javascript

Sanitize/Rewrite HTML on the Client Side


Update 2016: There is now a Google Closure package based on the Caja sanitizer.

It has a cleaner API, was rewritten to take into account APIs available on modern browsers, and interacts better with Closure Compiler.


Shameless plug: see caja/plugin/html-sanitizer.js for a client side html sanitizer that has been thoroughly reviewed.

It is white-listed, not black-listed, but the whitelists are configurable as per CajaWhitelists


If you want to remove all tags, then do the following:

var tagBody = '(?:[^"\'>]|"[^"]*"|\'[^\']*\')*';var tagOrComment = new RegExp(    '<(?:'    // Comment body.    + '!--(?:(?:-*[^->])*--+|-?)'    // Special "raw text" elements whose content should be elided.    + '|script\\b' + tagBody + '>[\\s\\S]*?</script\\s*'    + '|style\\b' + tagBody + '>[\\s\\S]*?</style\\s*'    // Regular name    + '|/?[a-z]'    + tagBody    + ')>',    'gi');function removeTags(html) {  var oldHtml;  do {    oldHtml = html;    html = html.replace(tagOrComment, '');  } while (html !== oldHtml);  return html.replace(/</g, '<');}

People will tell you that you can create an element, and assign innerHTML and then get the innerText or textContent, and then escape entities in that. Do not do that. It is vulnerable to XSS injection since <img src=bogus onerror=alert(1337)> will run the onerror handler even if the node is never attached to the DOM.


The Google Caja HTML sanitizer can be made "web-ready" by embedding it in a web worker. Any global variables introduced by the sanitizer will be contained within the worker, plus processing takes place in its own thread.

For browsers that do not support Web Workers, we can use an iframe as a separate environment for the sanitizer to work in. Timothy Chien has a polyfill that does just this, using iframes to simulate Web Workers, so that part is done for us.

The Caja project has a wiki page on how to use Caja as a standalone client-side sanitizer:

  • Checkout the source, then build by running ant
  • Include html-sanitizer-minified.js or html-css-sanitizer-minified.js in your page
  • Call html_sanitize(...)

The worker script only needs to follow those instructions:

importScripts('html-css-sanitizer-minified.js'); // or 'html-sanitizer-minified.js'var urlTransformer, nameIdClassTransformer;// customize if you need to filter URLs and/or ids/names/classesurlTransformer = nameIdClassTransformer = function(s) { return s; };// when we receive some HTMLself.onmessage = function(event) {    // sanitize, then send the result back    postMessage(html_sanitize(event.data, urlTransformer, nameIdClassTransformer));};

(A bit more code is needed to get the simworker library working, but it's not important to this discussion.)

Demo: https://dl.dropbox.com/u/291406/html-sanitize/demo.html


Never trust the client. If you're writing a server application, assume that the client will always submit unsanitary, malicious data. It's a rule of thumb that will keep you out of trouble. If you can, I would advise doing all validation and sanitation in server code, which you know (to a reasonable degree) won't be fiddled with. Perhaps you could use a serverside web application as a proxy for your clientside code, which fetches from the 3rd party and does sanitation before sending it to the client itself?

[edit] I'm sorry, I misunderstood the question. However, I stand by my advice. Your users will probably be safer if you sanitize on the server before sending it to them.