How to determine the encoding of text?

python encoding text-files

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimizedfor specific languages, and languagesare not random. Some charactersequences pop up all the time, whileother sequences make no sense. Aperson fluent in English who opens anewspaper and finds “txzqJv 2!dasd0aQqdKjvz” will instantly recognize thatthat isn't English (even though it iscomposed entirely of English letters).By studying lots of “typical” text, acomputer algorithm can simulate thiskind of fluency and make an educatedguess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
An encoding sniffed by the chardet library, if you have it installed.
UTF-8
Windows-1252

python encoding text-files

Another option for working out the encoding is to uselibmagic (which is the code behind thefile command). There are a profusion ofpython bindings available.

The python bindings that live in the file source tree are available as thepython-magic (or python3-magic)debian package. It can determine the encoding of a file by doing:

import magicblob = open('unknown-file', 'rb').read()m = magic.open(magic.MAGIC_MIME_ENCODING)m.load()encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc

There is an identically named, but incompatible, python-magic pip package on pypi that also uses libmagic. It can also get the encoding, by doing:

import magicblob = open('unknown-file', 'rb').read()m = magic.Magic(mime_encoding=True)encoding = m.from_buffer(blob)

python encoding text-files

Some encoding strategies, please uncomment to taste :

#!/bin/bash#tmpfile=$1echo '-- info about file file ........'file -i $tmpfileenca -g $tmpfileecho 'recoding ........'#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile#enca -x utf-8 $tmpfile#enca -g $tmpfilerecode CP1250..UTF-8 $tmpfile

You might like to check the encoding by opening and reading the file in a form of a loop... but you might need to check the filesize first :

#PYTHONencodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more            for e in encodings:                try:                    fh = codecs.open('file.txt', 'r', encoding=e)                    fh.readlines()                    fh.seek(0)                except UnicodeDecodeError:                    print('got unicode error with %s , trying different encoding' % e)                else:                    print('opening the file with encoding:  %s ' % e)                    break

CodeHunter

How to determine the encoding of text?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last