Convert XML/HTML Entities into Unicode String in Python [duplicate] Convert XML/HTML Entities into Unicode String in Python [duplicate] python python

Convert XML/HTML Entities into Unicode String in Python [duplicate]


Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.

Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:

import re, htmlentitydefs### Removes HTML or XML character references and entities from a text string.## @param text The HTML (or XML) source text.# @return The plain text, as a Unicode string, if necessary.def unescape(text):    def fixup(m):        text = m.group(0)        if text[:2] == "&#":            # character reference            try:                if text[:3] == "&#x":                    return unichr(int(text[3:-1], 16))                else:                    return unichr(int(text[2:-1]))            except ValueError:                pass        else:            # named entity            try:                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])            except KeyError:                pass        return text # leave as is    return re.sub("&#?\w+;", fixup, text)


The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

up to Python 3.4:

import HTMLParserh = HTMLParser.HTMLParser()h.unescape('© 2010') # u'\xa9 2010'h.unescape('© 2010') # u'\xa9 2010'

Python 3.4+:

import htmlhtml.unescape('© 2010') # u'\xa9 2010'html.unescape('© 2010') # u'\xa9 2010'


Use the builtin unichr -- BeautifulSoup isn't necessary:

>>> entity = '&#x01ce'>>> unichr(int(entity[3:],16))u'\u01ce'