Decode and access mbox file with mbox Python mdule
Question:
I need to migrate an email database to a CRMand have 2 problems:
I get to access the mbox file but the content is not properly decoded.
I want to create a dataframe like structure with following columns: "date, from, to, subject, body"
I have tried the following:
for i, message in enumerate(mbox):
print("from :",message['from'])
print("subject:",message['subject'])
if message.is_multipart():
content = (part.get_payload(decode=True) for part in message.get_payload())
else:
content = message.get_payload(decode=True)
print("content:",content)
print("**************************************")
if i == 10:
break
and get the following output:
from : =?UTF-8?Q?Gonzalo_Gasset_Yba=C3=B1ez?= <[email protected]>
subject: =?UTF-8?Q?Marqu=C3=A9s_de_Vargas_=26_Baud?=
content: <generator object <genexpr> at 0x7fe025f3a350>
**************************************
from : Mailtrack Reminder <[email protected]>
subject: Re: Presupuesto de Logotipo y =?utf-8?Q?Dise=C3=B1o?= Corporativo
para nuevo proyecto
content: b'<!DOCTYPE html>rn<html>rn<head>rn <meta charset="utf-8">rn <meta name="viewport" content="width=device-width">rn <title>Reminder</title>rn</head>rn<style media="screen">rn body {rn font-family: Helvetica;rn }rn</style>rn<body style="background-color: #f6f6f6; -webkit-font-smoothing: antialiased; font-size: 14px; line-height: 1.4; margin: 0; padding: 0; .....
Answers:
The concrete implementations of mailbox.Mailbox accept a factory
argument that can be used to build messages. By passing the parse method of a BytesParser initialised with the default policy we can generate EmailMessages which will decode headers and body text automatically.
Selecting the actual body is trickier, and perhaps depends on your particular requirements. In the code sample below, any "text" type parts are joined together, while non-text parts are rejected. You might wish to apply your own selection criteria.
from email.parser import BytesParser
from email.policy import default
import mailbox
mbox = mailbox.mbox(path_to_mailbox, factory=BytesParser(policy=default).parse)
for _, message in enumerate(mbox):
print("date: :", message['date'])
print("to: :", message['to'])
print("from :", message['from'])
print("subject:", message['subject'])
if message.is_multipart():
contents = []
for part in message.walk():
maintype = part.get_content_maintype()
if maintype == 'multipart' or maintype != 'text':
# Reject containers and non-text types
continue
contents.append(part.get_content())
content = 'nn'.join(contents)
else:
content = message.get_content()
print("content:", content)
print("**************************************")
@snakecharmerb, sorry, I am not allowed to comment your entry.
From your code I used the line starting with mbox =
. That works fine for me.
Alas, mypy is not very pleased with that:
error: Argument "factory" to "mbox" has incompatible type "Callable[[BinaryIO, bool], Message]"; expected
"Optional[Callable[[IO[Any]], mboxMessage]]" [arg-type]
mbox = mailbox.mbox(mboxfile, factory=BytesParser(policy=default).parse)
^
Can you help me there? Thanks in advance.
I need to migrate an email database to a CRMand have 2 problems:
I get to access the mbox file but the content is not properly decoded.
I want to create a dataframe like structure with following columns: "date, from, to, subject, body"
I have tried the following:
for i, message in enumerate(mbox):
print("from :",message['from'])
print("subject:",message['subject'])
if message.is_multipart():
content = (part.get_payload(decode=True) for part in message.get_payload())
else:
content = message.get_payload(decode=True)
print("content:",content)
print("**************************************")
if i == 10:
break
and get the following output:
from : =?UTF-8?Q?Gonzalo_Gasset_Yba=C3=B1ez?= <[email protected]>
subject: =?UTF-8?Q?Marqu=C3=A9s_de_Vargas_=26_Baud?=
content: <generator object <genexpr> at 0x7fe025f3a350>
**************************************
from : Mailtrack Reminder <[email protected]>
subject: Re: Presupuesto de Logotipo y =?utf-8?Q?Dise=C3=B1o?= Corporativo
para nuevo proyecto
content: b'<!DOCTYPE html>rn<html>rn<head>rn <meta charset="utf-8">rn <meta name="viewport" content="width=device-width">rn <title>Reminder</title>rn</head>rn<style media="screen">rn body {rn font-family: Helvetica;rn }rn</style>rn<body style="background-color: #f6f6f6; -webkit-font-smoothing: antialiased; font-size: 14px; line-height: 1.4; margin: 0; padding: 0; .....
The concrete implementations of mailbox.Mailbox accept a factory
argument that can be used to build messages. By passing the parse method of a BytesParser initialised with the default policy we can generate EmailMessages which will decode headers and body text automatically.
Selecting the actual body is trickier, and perhaps depends on your particular requirements. In the code sample below, any "text" type parts are joined together, while non-text parts are rejected. You might wish to apply your own selection criteria.
from email.parser import BytesParser
from email.policy import default
import mailbox
mbox = mailbox.mbox(path_to_mailbox, factory=BytesParser(policy=default).parse)
for _, message in enumerate(mbox):
print("date: :", message['date'])
print("to: :", message['to'])
print("from :", message['from'])
print("subject:", message['subject'])
if message.is_multipart():
contents = []
for part in message.walk():
maintype = part.get_content_maintype()
if maintype == 'multipart' or maintype != 'text':
# Reject containers and non-text types
continue
contents.append(part.get_content())
content = 'nn'.join(contents)
else:
content = message.get_content()
print("content:", content)
print("**************************************")
@snakecharmerb, sorry, I am not allowed to comment your entry.
From your code I used the line starting with mbox =
. That works fine for me.
Alas, mypy is not very pleased with that:
error: Argument "factory" to "mbox" has incompatible type "Callable[[BinaryIO, bool], Message]"; expected
"Optional[Callable[[IO[Any]], mboxMessage]]" [arg-type]
mbox = mailbox.mbox(mboxfile, factory=BytesParser(policy=default).parse)
^
Can you help me there? Thanks in advance.