How to determine the encoding of text?
EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative
Correctly detecting the encoding all times is impossible.
(From chardet FAQ:)
However, some encodings are optimizedfor specific languages, and languagesare not random. Some charactersequences pop up all the time, whileother sequences make no sense. Aperson fluent in English who opens anewspaper and finds “txzqJv 2!dasd0aQqdKjvz” will instantly recognize thatthat isn't English (even though it iscomposed entirely of English letters).By studying lots of “typical” text, acomputer algorithm can simulate thiskind of fluency and make an educatedguess about a text's language.
There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.
You can also use UnicodeDammit. It will try the following methods:
- An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
- An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
- An encoding sniffed by the chardet library, if you have it installed.
import magicblob = open('unknown-file', 'rb').read()m = magic.open(magic.MAGIC_MIME_ENCODING)m.load()encoding = m.buffer(blob) # "utf-8" "us-ascii" etc
There is an identically named, but incompatible, python-magic pip package on pypi that also uses
libmagic. It can also get the encoding, by doing:
import magicblob = open('unknown-file', 'rb').read()m = magic.Magic(mime_encoding=True)encoding = m.from_buffer(blob)
Some encoding strategies, please uncomment to taste :
#!/bin/bash#tmpfile=$1echo '-- info about file file ........'file -i $tmpfileenca -g $tmpfileecho 'recoding ........'#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile#enca -x utf-8 $tmpfile#enca -g $tmpfilerecode CP1250..UTF-8 $tmpfile
You might like to check the encoding by opening and reading the file in a form of a loop... but you might need to check the filesize first :
#PYTHONencodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more for e in encodings: try: fh = codecs.open('file.txt', 'r', encoding=e) fh.readlines() fh.seek(0) except UnicodeDecodeError: print('got unicode error with %s , trying different encoding' % e) else: print('opening the file with encoding: %s ' % e) break