How to convert HTML page to plain text in node.js? How to convert HTML page to plain text in node.js? node.js node.js

How to convert HTML page to plain text in node.js?


Use jsdom and jQuery (server-side).

With jQuery you can delete all scripts, styles, templates and the like and then you can extract the text.

Example

(This is not tested with jsdom and node, only in Chrome)

jQuery('script').remove()jQuery('noscript').remove()jQuery('body').text().replace(/\s{2,9999}/g, ' ')


For those searching for a regex solution, here is my one

const HTMLPartToTextPart = (HTMLPart) => (  HTMLPart    .replace(/\n/ig, '')    .replace(/<style[^>]*>[\s\S]*?<\/style[^>]*>/ig, '')    .replace(/<head[^>]*>[\s\S]*?<\/head[^>]*>/ig, '')    .replace(/<script[^>]*>[\s\S]*?<\/script[^>]*>/ig, '')    .replace(/<\/\s*(?:p|div)>/ig, '\n')    .replace(/<br[^>]*\/?>/ig, '\n')    .replace(/<[^>]*>/ig, '')    .replace(' ', ' ')    .replace(/[^\S\r\n][^\S\r\n]+/ig, ' '));


As another answer suggested, use JSDOM, but you don't need jQuery. Try this:

JSDOM.fragment(sourceHtml).textContent