How to download any(!) webpage with correct charset in python?

python character-encoding screen-scraping urllib2 urllib

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted:

fp = urllib2.urlopen(request)charset = fp.headers.getparam('charset')

You can use BeautifulSoup to locate a meta element in the HTML:

soup = BeatifulSoup.BeautifulSoup(data)meta = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})

If neither is available, browsers typically fall back to user configuration, combined with auto-detection. As rajax proposes, you could use the chardet module. If you have user configuration available telling you that the page should be Chinese (say), you may be able to do better.

python character-encoding screen-scraping urllib2 urllib

Use the Universal Encoding Detector:

>>> import chardet>>> chardet.detect(urlread("http://google.cn/")){'encoding': 'GB2312', 'confidence': 0.99}

The other option would be to just use wget:

  import os  h = os.popen('wget -q -O foo1.txt http://foo.html')  h.close()  s = open('foo1.txt').read()

python character-encoding screen-scraping urllib2 urllib

It seems like you need a hybrid of the answers presented:

Fetch the page using urllib
Find <meta> tags using beautiful soup or other method
If no meta tags exist, check the headers returned by urllib
If that still doesn't give you an answer, use the universal encoding detector.

I honestly don't believe you're going to find anything better than that.

In fact if you read further into the FAQ you linked to in the comments on the other answer, that's what the author of detector library advocates.

If you believe the FAQ, this is what the browsers do (as requested in your original question) as the detector is a port of the firefox sniffing code.

CodeHunter

How to download any(!) webpage with correct charset in python?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last