How do I fix wrongly nested / unclosed HTML tags? How do I fix wrongly nested / unclosed HTML tags? python python

How do I fix wrongly nested / unclosed HTML tags?


using BeautifulSoup:

from BeautifulSoup import BeautifulSouphtml = "<p><ul><li>Foo"soup = BeautifulSoup(html)print soup.prettify()

gets you

<p> <ul>  <li>   Foo  </li> </ul></p>

As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.

using Tidy:

import tidyhtml = "<p><ul><li>Foo"print tidy.parseString(html, show_body_only=True)

gets you

<ul><li>Foo</li></ul>

Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing

print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)

comes out as

<p></p><ul><li>Foo</li></ul>

Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.

Finally, Tidy can also do indenting:

print tidy.parseString(html, show_body_only=True, indent=True)

becomes

<ul>  <li>Foo  </li></ul>

All of these have their ups and downs, but hopefully one of them is close enough.


Run it through Tidy or one of its ported libraries.

Try to code it by hand and you will want to gouge your eyes out.


use html5lib, work great!like this.

soup = BeautifulSoup(data, 'html5lib')