Decoding double encoded utf8 in Python

python string utf-8 decode

>>> s = u'Rafa\xc5\x82'>>> s.encode('raw_unicode_escape').decode('utf-8')u'Rafa\u0142'>>>

python string utf-8 decode

Yow, that was fun!

>>> original = "Rafa\xc3\x85\xc2\x82">>> first_decode = original.decode('utf-8')>>> as_chars = ''.join([chr(ord(x)) for x in first_decode])>>> result = as_chars.decode('utf-8')>>> resultu'Rafa\u0142'

So you do the first decode, getting a Unicode string where each character is actually a UTF-8 byte value. You go via the integer value of each of those characters to get back to a genuine UTF-8 string, which you then decode as normal.

python string utf-8 decode

>>> weird = u'Rafa\xc5\x82'>>> weird.encode('latin1').decode('utf8')u'Rafa\u0142'>>>

latin1 is just an abbreviation for Richie's nuts'n'bolts method.

It is very curious that the seriously under-described raw_unicode_escape codec gives the same result as latin1 in this case. Do they always give the same result? If so, why have such a codec? If not, it would preferable to know for sure exactly how the OP's client did the transformation from 'Rafa\xc5\x82' to u'Rafa\xc5\x82' and then to reverse that process exactly -- otherwise we might come unstuck if different data crops up before the double encoding is fixed.

CodeHunter

Decoding double encoded utf8 in Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last