How do I preserve line breaks when using jsoup to convert html to plain text? How do I preserve line breaks when using jsoup to convert html to plain text? java java

How do I preserve line breaks when using jsoup to convert html to plain text?


The real solution that preserves linebreaks should be like this:

public static String br2nl(String html) {    if(html==null)        return html;    Document document = Jsoup.parse(html);    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing    document.select("br").append("\\n");    document.select("p").prepend("\\n\\n");    String s = document.html().replaceAll("\\\\n", "\n");    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));}

It satisfies the following requirements:

  1. if the original html contains newline(\n), it gets preserved
  2. if the original html contains br or p tags, they gets translated to newline(\n).


With

Jsoup.parse("A\nB").text();

you have output

"A B" 

and not

AB

For this I'm using:

descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();text = descrizione.replaceAll("br2n", "\n");


Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

We're using this method here:

public static String clean(String bodyHtml,                       String baseUri,                       Whitelist whitelist,                       Document.OutputSettings outputSettings)

By passing it Whitelist.none() we make sure that all HTML is removed.

By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.