Python – Auto Detect Email Content Encoding

Question:

I am writing a script to process emails, and I have access to the raw string content of the emails.

I am currently looking for the string “Content-Transfer-Encoding:” and scanning the characters that follow immediately after, to determine the encoding. Example encodings: base64 or 7bit or quoted-printable ..

Is there a better way to automatically determine the email encoding(at least a more pythonic way)?

Thank you.

Asked By: sisanared

||

Answers:

Python: Is there a way to determine the encoding of text file? has some good answers. Basically there’s no way to do it perfectly reliably, and the initial approach you’re using is the best (and should be checked first), but if it isn’t there then there are a few options that can work sometimes.

Answered By: Tim Radcliffe

You may use this standard Python package: email.

For example:

import email

raw = """From: John Doe <[email protected]>
MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

Hi there!
"""

my_email = email.message_from_string(raw)
print my_email["Content-Transfer-Encoding"]

See other examples here.

Answered By: turdus-merula

You can get the decoded message body with
get_payload(decode=True)

import email
import sys

raw = """From: John Doe <[email protected]>
MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

Hi there!
this is a test =F0=9F=98=80
with accents =C3=A9 =C3=A0
"""

my_email = email.message_from_string(raw)
print(my_email.get_payload(decode=True).decode(sys.stdout.encoding))

get_payload(decode=True) return binary data so you need to decode it with decode(sys.stdout.encoding) or decode("utf-8")

the python doc

Optional decode is a flag indicating whether the payload should be decoded or not, according to the Content-Transfer-Encoding header. When True and the message is not a multipart, the payload will be decoded if this header’s value is quoted-printable or base64. If some other encoding is used, or Content-Transfer-Encoding header is missing, the payload is returned as-is (undecoded). In all cases the returned value is binary data. If the message is a multipart and the decode flag is True, then None is returned. If the payload is base64 and it was not perfectly formed (missing padding, characters outside the base64 alphabet), then an appropriate defect will be added to the message’s defect property (InvalidBase64PaddingDefect or InvalidBase64CharactersDefect, respectively).

Answered By: europrimus
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.