Convert HTML entities to Unicode and vice versa Convert HTML entities to Unicode and vice versa python python

Convert HTML entities to Unicode and vice versa


As to the "vice versa" (which I needed myself, leading me to find this question, which didn't help, and subsequently another site which had the answer):

u'some string'.encode('ascii', 'xmlcharrefreplace')

will return a plain string with any non-ascii characters turned into XML (HTML) entities.


You need to have BeautifulSoup.

from BeautifulSoup import BeautifulStoneSoupimport cgidef HTMLEntitiesToUnicode(text):    """Converts HTML entities to unicode.  For example '&' becomes '&'."""    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))    return textdef unicodeToHTMLEntities(text):    """Converts unicode to HTML entities.  For example '&' becomes '&'."""    text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')    return texttext = "&, ®, <, >, ¢, £, ¥, €, §, ©"uni = HTMLEntitiesToUnicode(text)htmlent = unicodeToHTMLEntities(uni)print uniprint htmlent# &, ®, <, >, ¢, £, ¥, €, §, ©# &, &#174;, <, >, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;


Update for Python 2.7 and BeautifulSoup4

Unescape -- Unicode HTML to unicode with htmlparser (Python 2.7 standard lib):

>>> escaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'>>> from HTMLParser import HTMLParser>>> htmlparser = HTMLParser()>>> unescaped = htmlparser.unescape(escaped)>>> unescapedu'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'>>> print unescapedMonsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Unescape -- Unicode HTML to unicode with bs4 (BeautifulSoup4):

>>> html = '''<p>Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood</p>'''>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(html)>>> soup.textu'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'>>> print soup.textMonsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Escape -- Unicode to unicode HTML with bs4 (BeautifulSoup4):

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'>>> from bs4.dammit import EntitySubstitution>>> escaper = EntitySubstitution()>>> escaped = escaper.substitute_html(unescaped)>>> escapedu'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'