Convert XML/HTML Entities into Unicode String in Python [duplicate]

Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.

Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:

import re, htmlentitydefs### Removes HTML or XML character references and entities from a text string.## @param text The HTML (or XML) source text.# @return The plain text, as a Unicode string, if necessary.def unescape(text):    def fixup(m):        text = m.group(0)        if text[:2] == "&#":            # character reference            try:                if text[:3] == "&#x":                    return unichr(int(text[3:-1], 16))                else:                    return unichr(int(text[2:-1]))            except ValueError:                pass        else:            # named entity            try:                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])            except KeyError:                pass        return text # leave as is    return re.sub("&#?\w+;", fixup, text)

python html entities

The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

up to Python 3.4:

import HTMLParserh = HTMLParser.HTMLParser()h.unescape('© 2010') # u'\xa9 2010'h.unescape('&#169; 2010') # u'\xa9 2010'

Python 3.4+:

import htmlhtml.unescape('© 2010') # u'\xa9 2010'html.unescape('&#169; 2010') # u'\xa9 2010'

python html entities

Use the builtin unichr -- BeautifulSoup isn't necessary:

>>> entity = '&#x01ce'>>> unichr(int(entity[3:],16))u'\u01ce'

CodeHunter

Convert XML/HTML Entities into Unicode String in Python [duplicate]

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last