Python – Auto Detect Email Content Encoding
Question:
I am writing a script to process emails, and I have access to the raw string content of the emails.
I am currently looking for the string “Content-Transfer-Encoding:” and scanning the characters that follow immediately after, to determine the encoding. Example encodings: base64 or 7bit or quoted-printable ..
Is there a better way to automatically determine the email encoding(at least a more pythonic way)?
Thank you.
Answers:
Python: Is there a way to determine the encoding of text file? has some good answers. Basically there’s no way to do it perfectly reliably, and the initial approach you’re using is the best (and should be checked first), but if it isn’t there then there are a few options that can work sometimes.
You may use this standard Python package: email.
For example:
import email
raw = """From: John Doe <[email protected]>
MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable
Hi there!
"""
my_email = email.message_from_string(raw)
print my_email["Content-Transfer-Encoding"]
See other examples here.
You can get the decoded message body with
get_payload(decode=True)
import email
import sys
raw = """From: John Doe <[email protected]>
MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable
Hi there!
this is a test =F0=9F=98=80
with accents =C3=A9 =C3=A0
"""
my_email = email.message_from_string(raw)
print(my_email.get_payload(decode=True).decode(sys.stdout.encoding))
get_payload(decode=True)
return binary data so you need to decode it with decode(sys.stdout.encoding)
or decode("utf-8")
the python doc
Optional decode is a flag indicating whether the payload should be decoded or not, according to the Content-Transfer-Encoding header. When True and the message is not a multipart, the payload will be decoded if this header’s value is quoted-printable or base64. If some other encoding is used, or Content-Transfer-Encoding header is missing, the payload is returned as-is (undecoded). In all cases the returned value is binary data. If the message is a multipart and the decode flag is True, then None is returned. If the payload is base64 and it was not perfectly formed (missing padding, characters outside the base64 alphabet), then an appropriate defect will be added to the message’s defect property (InvalidBase64PaddingDefect or InvalidBase64CharactersDefect, respectively).
I am writing a script to process emails, and I have access to the raw string content of the emails.
I am currently looking for the string “Content-Transfer-Encoding:” and scanning the characters that follow immediately after, to determine the encoding. Example encodings: base64 or 7bit or quoted-printable ..
Is there a better way to automatically determine the email encoding(at least a more pythonic way)?
Thank you.
Python: Is there a way to determine the encoding of text file? has some good answers. Basically there’s no way to do it perfectly reliably, and the initial approach you’re using is the best (and should be checked first), but if it isn’t there then there are a few options that can work sometimes.
You may use this standard Python package: email.
For example:
import email
raw = """From: John Doe <[email protected]>
MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable
Hi there!
"""
my_email = email.message_from_string(raw)
print my_email["Content-Transfer-Encoding"]
See other examples here.
You can get the decoded message body with
get_payload(decode=True)
import email
import sys
raw = """From: John Doe <[email protected]>
MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable
Hi there!
this is a test =F0=9F=98=80
with accents =C3=A9 =C3=A0
"""
my_email = email.message_from_string(raw)
print(my_email.get_payload(decode=True).decode(sys.stdout.encoding))
get_payload(decode=True)
return binary data so you need to decode it with decode(sys.stdout.encoding)
or decode("utf-8")
the python doc
Optional decode is a flag indicating whether the payload should be decoded or not, according to the Content-Transfer-Encoding header. When True and the message is not a multipart, the payload will be decoded if this header’s value is quoted-printable or base64. If some other encoding is used, or Content-Transfer-Encoding header is missing, the payload is returned as-is (undecoded). In all cases the returned value is binary data. If the message is a multipart and the decode flag is True, then None is returned. If the payload is base64 and it was not perfectly formed (missing padding, characters outside the base64 alphabet), then an appropriate defect will be added to the message’s defect property (InvalidBase64PaddingDefect or InvalidBase64CharactersDefect, respectively).