A good way to get the charset/encoding of an HTTP response in Python

python character-encoding httprequest urllib2

To parse http header you could use cgi.parse_header():

_, params = cgi.parse_header('text/html; charset=utf-8')print params['charset'] # -> utf-8

Or using the response object:

response = urllib2.urlopen('http://example.com')response_encoding = response.headers.getparam('charset')# or in Python 3: response.headers.get_content_charset(default)

In general the server may lie about the encoding or do not report it at all (the default depends on content-type) or the encoding might be specified inside the response body e.g., <meta> element in html documents or in xml declaration for xml documents. As a last resort the encoding could be guessed from the content itself.

You could use requests to get Unicode text:

import requests # pip install requestsr = requests.get(url)unicode_str = r.text # may use `chardet` to auto-detect encoding

Or BeautifulSoup to parse html (and convert to Unicode as a side-effect):

from bs4 import BeautifulSoup # pip install beautifulsoup4soup = BeautifulSoup(urllib2.urlopen(url)) # may use `cchardet` for speed# ...

Or bs4.UnicodeDammit directly for arbitrary content (not necessarily an html):

from bs4 import UnicodeDammitdammit = UnicodeDammit(b"Sacr\xc3\xa9 bleu!")print(dammit.unicode_markup)# -> Sacré bleu!print(dammit.original_encoding)# -> utf-8

python character-encoding httprequest urllib2

If you happen to be familiar with the Flask/Werkzeug web development stack, you will be happy to know the Werkzeug library has an answer for exactly this kind of HTTP header parsing, and accounts for the case that the content-type is not specified at all, like you had wanted.

 >>> from werkzeug.http import parse_options_header >>> import requests >>> url = 'http://some.url.value' >>> resp = requests.get(url) >>> if resp.status_code is requests.codes.ok: ...     content_type_header = resp.headers.get('content_type') ...     print content_type_header 'text/html; charset=utf-8' >>> parse_options_header(content_type_header)  ('text/html', {'charset': 'utf-8'})

So then you can do:

 >>> content_type_header[1].get('charset') 'utf-8'

Note that if charset is not supplied, this will produce instead:

 >>> parse_options_header('text/html') ('text/html', {})

It even works if you don't supply anything but an empty string or dict:

 >>> parse_options_header({}) ('', {}) >>> parse_options_header('') ('', {})

Thus it seems to be EXACTLY what you were looking for! If you look at the source code, you will see they had your purpose in mind: https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/http.py#L320-329

def parse_options_header(value):    """Parse a ``Content-Type`` like header into a tuple with the content    type and the options:    >>> parse_options_header('text/html; charset=utf8')    ('text/html', {'charset': 'utf8'})    This should not be used to parse ``Cache-Control`` like headers that use    a slightly different format.  For these headers use the    :func:`parse_dict_header` function.    ...

Hope this helps someone some day! :)

python character-encoding httprequest urllib2

The requests library makes this easy:

>>> import requests>>> r = requests.get('http://some.url.value')>>> r.encoding'utf-8' # e.g.

CodeHunter

A good way to get the charset/encoding of an HTTP response in Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last