Decoding HTML entities with Python

python unicode character-encoding content-type beautifulsoup

>>> from HTMLParser import HTMLParser>>> print HTMLParser().unescape('U.S. Adviser&#8217;s Blunt Memo on Iraq: '...                             'Time &#8216;to Go Home&#8217;')U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’

The function is undocumented in Python 2. It is fixed in Python 3.4+: it is exposed as html.unescape() there.

python unicode character-encoding content-type beautifulsoup

Actually what you have are not HTML entities. There are THREE varieties of those &.....; thingies -- for example     all mean U+00A0 NO-BREAK SPACE.

  (the type you have) is a "numeric character reference" (decimal).
  is a "numeric character reference" (hexadecimal).
is an entity.

Further reading: http://htmlhelp.com/reference/html40/entities/

Here you will find code for Python2.x that does all three in one scan through the input: http://effbot.org/zone/re-sub.htm#unescape-html

python unicode character-encoding content-type beautifulsoup

This does work:

from BeautifulSoup import BeautifulStoneSoups = "U.S. Adviser&#8217;s Blunt Memo on Iraq: Time &#8216;to Go Home&#8217;"decoded = BeautifulStoneSoup(s, convertEntities=BeautifulStoneSoup.HTML_ENTITIES)

If you want a string instead of a Unicode object, you'll need to decode it to an encoding that supports the characters being used; ISO-8859-1 doesn't:

result = decoded.encode("UTF-8")

It's unfortunate that you need an external module for something like this; simple HTML/XML entity decoding should be in the standard library, and not require me to use a library with meaningless class names like "BeautifulStoneSoup". (Class and function names should not be "creative", they should be meaningful.)

CodeHunter

Decoding HTML entities with Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last