How to detect string byte encoding? How to detect string byte encoding? python python

How to detect string byte encoding?


Use chardet library. It is super easy

import chardetthe_encoding = chardet.detect('your string')['encoding']

and that's it!

in python3 you need to provide type bytes or bytearray so:

import chardetthe_encoding = chardet.detect(b'your string')['encoding']


if your files either in cp1252 and utf-8, then there is an easy way.

import loggingdef force_decode(string, codecs=['utf8', 'cp1252']):    for i in codecs:        try:            return string.decode(i)        except UnicodeDecodeError:            pass    logging.warn("cannot decode url %s" % ([string]))for item in os.listdir(rootPath):    #Convert to Unicode    if isinstance(item, str):        item = force_decode(item)    print item

otherwise, there is a charset detect lib.

Python - detect charset and convert to utf-8

https://pypi.python.org/pypi/chardet


You also can use json package to detect encoding.

import jsonjson.detect_encoding(b"Hello")