Convert XML/HTML Entities into Unicode String in Python [duplicate]
Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.
Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:
import re, htmlentitydefs### Removes HTML or XML character references and entities from a text string.## @param text The HTML (or XML) source text.# @return The plain text, as a Unicode string, if necessary.def unescape(text): def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return unichr(int(text[3:-1], 16)) else: return unichr(int(text[2:-1])) except ValueError: pass else: # named entity try: text = unichr(htmlentitydefs.name2codepoint[text[1:-1]]) except KeyError: pass return text # leave as is return re.sub("&#?\w+;", fixup, text)
The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:
up to Python 3.4:
import HTMLParserh = HTMLParser.HTMLParser()h.unescape('© 2010') # u'\xa9 2010'h.unescape('© 2010') # u'\xa9 2010'
Python 3.4+:
import htmlhtml.unescape('© 2010') # u'\xa9 2010'html.unescape('© 2010') # u'\xa9 2010'