Decode HTML entities in Python string? Decode HTML entities in Python string? python python

Decode HTML entities in Python string?

Python 3.4+

Use html.unescape():

import htmlprint(html.unescape('£682m'))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.

Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

>>> try:...     # Python 2.6-2.7 ...     from HTMLParser import HTMLParser... except ImportError:...     # Python 3...     from html.parser import HTMLParser... >>> h = HTMLParser()>>> print(h.unescape('£682m'))£682m

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser>>> h = HTMLParser()>>> print(h.unescape('£682m'))£682m

Beautiful Soup handles entity conversion. In Beautiful Soup 3, you'll need to specify the convertEntities argument to the BeautifulSoup constructor (see the 'Entity Conversion' section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.

Beautiful Soup 3

>>> from BeautifulSoup import BeautifulSoup>>> BeautifulSoup("<p>£682m</p>", ...               convertEntities=BeautifulSoup.HTML_ENTITIES)<p>£682m</p>

Beautiful Soup 4

>>> from bs4 import BeautifulSoup>>> BeautifulSoup("<p682m</p>")<html><body><p682m</p></body></html>

You can use replace_entities from w3lib.html library

In [202]: from w3lib.html import replace_entitiesIn [203]: replace_entities("£682m")Out[203]: u'\xa3682m'In [204]: print replace_entities("£682m")£682m