Python - email header decoding UTF-8

This type of encoding is known as MIME encoded-word and the email module can decode it:

from email.header import decode_headerprint decode_header("""=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=""")

This outputs a list of tuples, containing the decoded string and the encoding used. This is because the format supports different encodings in a single header. To merge these into a single string you need to convert them into a shared encoding and then concatenate this, which can be accomplished using Python's unicode object:

from email.header import decode_headerdh = decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")default_charset = 'ASCII'print ''.join([ unicode(t[0], t[1] or default_charset) for t in dh ])

Update 2:

The problem with this Subject line not decoding:

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011                                                                     ^

Is actually the senders fault, which violates the requirement of encoded-words in a header being separated by white-space, specified in RFC 2047, section 5, paragraph 1: an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.

If need be, you can work around this by pre-processing these corrupt headers with a regex that inserts a whitespace after the encoded-word part (unless it's at the end), like so:

import reheader_value = re.sub(r"(=\?.*\?=)(?!$)", r"\1 ", header_value)

python email email-headers

I was just testing with encoded headers in Python 3.3, and I found that this is a very convenient way to deal with them:

>>> from email.header import Header, decode_header, make_header>>> subject = '[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?='>>> h = make_header(decode_header(subject))>>> str(h)'[ 201105161048 ] GewSt:  Wegfall der Vorläufigkeit'

As you can see it automatically adds whitespace around the encoded words.

It internally keeps the encoded and ASCII header parts separate as you can see when it re-encodes the non-ASCII parts:

>>> h.encode()'[ 201105161048 ] GewSt: =?utf-8?q?_Wegfall_der_Vorl=C3=A4ufigkeit?='

If you want the whole header re-encoded you could convert the header to a string and then back into a header:

>>> h2 = Header(str(h))>>> str(h2)'[ 201105161048 ] GewSt:  Wegfall der Vorläufigkeit'>>> h2.encode()'=?utf-8?q?=5B_201105161048_=5D_GewSt=3A__Wegfall_der_Vorl=C3=A4ufigkeit?='

python email email-headers

def decode_header(value):    return ' '.join((item[0].decode(item[1] or 'utf-8').encode('utf-8') for item in email.header.decode_header(value)))

CodeHunter

Python - email header decoding UTF-8

Update 2:

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last