Convert HTML entities to Unicode and vice versa
As to the "vice versa" (which I needed myself, leading me to find this question, which didn't help, and subsequently another site which had the answer):
u'some string'.encode('ascii', 'xmlcharrefreplace')
will return a plain string with any non-ascii characters turned into XML (HTML) entities.
You need to have BeautifulSoup.
from BeautifulSoup import BeautifulStoneSoupimport cgidef HTMLEntitiesToUnicode(text): """Converts HTML entities to unicode. For example '&' becomes '&'.""" text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES)) return textdef unicodeToHTMLEntities(text): """Converts unicode to HTML entities. For example '&' becomes '&'.""" text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace') return texttext = "&, ®, <, >, ¢, £, ¥, €, §, ©"uni = HTMLEntitiesToUnicode(text)htmlent = unicodeToHTMLEntities(uni)print uniprint htmlent# &, ®, <, >, ¢, £, ¥, €, §, ©# &, ®, <, >, ¢, £, ¥, €, §, ©
Update for Python 2.7 and BeautifulSoup4
Unescape -- Unicode HTML to unicode with htmlparser
(Python 2.7 standard lib):
>>> escaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'>>> from HTMLParser import HTMLParser>>> htmlparser = HTMLParser()>>> unescaped = htmlparser.unescape(escaped)>>> unescapedu'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'>>> print unescapedMonsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Unescape -- Unicode HTML to unicode with bs4
(BeautifulSoup4):
>>> html = '''<p>Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood</p>'''>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(html)>>> soup.textu'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'>>> print soup.textMonsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Escape -- Unicode to unicode HTML with bs4
(BeautifulSoup4):
>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'>>> from bs4.dammit import EntitySubstitution>>> escaper = EntitySubstitution()>>> escaped = escaper.substitute_html(unescaped)>>> escapedu'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'