How do I fix wrongly nested / unclosed HTML tags?
using BeautifulSoup:
from BeautifulSoup import BeautifulSouphtml = "<p><ul><li>Foo"soup = BeautifulSoup(html)print soup.prettify()
gets you
<p> <ul> <li> Foo </li> </ul></p>
As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.
using Tidy:
import tidyhtml = "<p><ul><li>Foo"print tidy.parseString(html, show_body_only=True)
gets you
<ul><li>Foo</li></ul>
Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing
print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)
comes out as
<p></p><ul><li>Foo</li></ul>
Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.
Finally, Tidy can also do indenting:
print tidy.parseString(html, show_body_only=True, indent=True)
becomes
<ul> <li>Foo </li></ul>
All of these have their ups and downs, but hopefully one of them is close enough.