Decode and access mbox file with mbox Python mdule

Question:

I need to migrate an email database to a CRMand have 2 problems:

I get to access the mbox file but the content is not properly decoded.

I want to create a dataframe like structure with following columns: "date, from, to, subject, body"

I have tried the following:

for i, message in enumerate(mbox):
    print("from   :",message['from'])
    print("subject:",message['subject'])
    if message.is_multipart():
        content = (part.get_payload(decode=True) for part in message.get_payload())
    else:
        content = message.get_payload(decode=True)
    print("content:",content)
    print("**************************************")

    if i == 10:
        break

and get the following output:

from   : =?UTF-8?Q?Gonzalo_Gasset_Yba=C3=B1ez?= <[email protected]>
subject: =?UTF-8?Q?Marqu=C3=A9s_de_Vargas_=26_Baud?=
content: <generator object <genexpr> at 0x7fe025f3a350>
**************************************
from   : Mailtrack Reminder <[email protected]>
subject: Re: Presupuesto de Logotipo y =?utf-8?Q?Dise=C3=B1o?= Corporativo
 para nuevo proyecto
content: b'<!DOCTYPE html>rn<html>rn<head>rn    <meta charset="utf-8">rn    <meta name="viewport" content="width=device-width">rn    <title>Reminder</title>rn</head>rn<style media="screen">rn    body {rn        font-family: Helvetica;rn    }rn</style>rn<body style="background-color: #f6f6f6; -webkit-font-smoothing: antialiased; font-size: 14px; line-height: 1.4; margin: 0; padding: 0; .....
Asked By: Lucas

||

Answers:

The concrete implementations of mailbox.Mailbox accept a factory argument that can be used to build messages. By passing the parse method of a BytesParser initialised with the default policy we can generate EmailMessages which will decode headers and body text automatically.

Selecting the actual body is trickier, and perhaps depends on your particular requirements. In the code sample below, any "text" type parts are joined together, while non-text parts are rejected. You might wish to apply your own selection criteria.

from email.parser import BytesParser
from email.policy import default
import mailbox

mbox = mailbox.mbox(path_to_mailbox, factory=BytesParser(policy=default).parse)

for _, message in enumerate(mbox):
    print("date:  :", message['date'])
    print("to:    :", message['to'])
    print("from   :", message['from'])
    print("subject:", message['subject'])
    if message.is_multipart():
        contents = []
        for part in message.walk():
            maintype = part.get_content_maintype()
            if maintype == 'multipart' or maintype != 'text':
                # Reject containers and non-text types
                continue
            contents.append(part.get_content())
        content = 'nn'.join(contents)
    else:
        content = message.get_content()
    print("content:", content)
    print("**************************************")
Answered By: snakecharmerb

@snakecharmerb, sorry, I am not allowed to comment your entry.
From your code I used the line starting with mbox =. That works fine for me.
Alas, mypy is not very pleased with that:

    error: Argument "factory" to "mbox" has incompatible type "Callable[[BinaryIO, bool], Message]"; expected
"Optional[Callable[[IO[Any]], mboxMessage]]"  [arg-type]
        mbox = mailbox.mbox(mboxfile, factory=BytesParser(policy=default).parse)
                                              ^

Can you help me there? Thanks in advance.

Answered By: Lonerider
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.