How do I unescape HTML entities in a string in Python 3.1? [duplicate]
You could use the function html.unescape:
In Python3.4+ (thanks to J.F. Sebastian for the update):
import htmlhtml.unescape('Suzy & John')# 'Suzy & John'html.unescape('"')# '"'
In Python3.3 or older:
import html.parser html.parser.HTMLParser().unescape('Suzy & John')
In Python2:
import HTMLParserHTMLParser.HTMLParser().unescape('Suzy & John')
You can use xml.sax.saxutils.unescape
for this purpose. This module is included in the Python standard library, and is portable between Python 2.x and Python 3.x.
>>> import xml.sax.saxutils as saxutils>>> saxutils.unescape("Suzy & John")'Suzy & John'
Apparently I don't have a high enough reputation to do anything but post this. unutbu's answer does not unescape quotations. The only thing that I found that did was this function:
import refrom htmlentitydefs import name2codepoint as n2cpdef decodeHtmlentities(string): def substitute_entity(match): ent = match.group(2) if match.group(1) == "#": return unichr(int(ent)) else: cp = n2cp.get(ent) if cp: return unichr(cp) else: return match.group() entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});") return entity_re.subn(substitute_entity, string)[0]
Which I got from this page.