Read a big .mbox file with Python

Question:

I’d like to read a big 3GB .mbox file coming from a Gmail backup. This works:

import mailbox
mbox = mailbox.mbox(r"D:All mail Including Spam and Trash.mbox")
for i, message in enumerate(mbox):
    print("from   :",message['from'])
    print("subject:",message['subject'])
    if message.is_multipart():
        content = ''.join(part.get_payload(decode=True) for part in message.get_payload())
    else:
        content = message.get_payload(decode=True)
    print("content:",content)
    print("**************************************")

    if i == 10:
        break

except it takes more than 40 seconds for the first 10 messages only.

Is there a faster way to access to a big .mbox file with Python?

Asked By: Basj

||

Answers:

Here’s a quick and dirty attempt to implement a generator to read in an mbox file message by message. I have opted to simply ditch the information from the From separator; I’m guessing maybe the real mailbox library might provide more information, and of course, this only supports reading, not searching or writing back to the input file.

#!/usr/bin/env python3

import email
from email.policy import default

class MboxReader:
    def __init__(self, filename):
        self.handle = open(filename, 'rb')
        assert self.handle.readline().startswith(b'From ')

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, exc_traceback):
        self.handle.close()

    def __iter__(self):
        return iter(self.__next__())

    def __next__(self):
        lines = []
        while True:
            line = self.handle.readline()
            if line == b'' or line.startswith(b'From '):
                yield email.message_from_bytes(b''.join(lines), policy=default)
                if line == b'':
                    break
                lines = []
                continue
            lines.append(line)

Usage:

with MboxReader(mboxfilename) as mbox:
    for message in mbox:
        print(message.as_string())

The policy=default argument (or any policy instead of default if you prefer, of course) selects the modern EmailMessage library which was introduced in Python 3.3 and became official in 3.6. If you need to support older Python versions from before America lost its mind and put an evil clown in the White House simpler times, you will want to omit it; but really, the new API is better in many ways.

Answered By: tripleee

Using the MboxReader Class mentioned here this link you can use any of the keys, to get specific info from the mbox object. Then can create text file for further analysis of your mailbox.

path = "your_gmail.mbox"
mbox = MboxReader(path)
from tqdm import tqdm

with open('Output.txt','w',encoding="utf-8") as file:
    for idx,message in tqdm(enumerate(mbox)):
        # print(message.keys())
        mail_from = f"{str(message['From'])}n".replace('"','')
        file.write(mail_from)
        print(idx,message['From'])

The following keys are allowed to be used, putting here for reference

['X-GM-THRID', 'X-Gmail-Labels', 'Delivered-To', 'Received', 'X-Received',
 'ARC-Seal', 'ARC-Message-Signature', 'ARC-Authentication-Results', 
'Return-Path', 'Received', 'Received-SPF', 'Authentication-Results', 
'DKIM-Signature', 'X-Google-DKIM-Signature', 'X-Gm-Message-State', 
'X-Google-Smtp-Source', 'MIME-Version', 'X-Received', 'Date', 'Reply-To',
 'X-Google-Id', 'Precedence', 'List-Unsubscribe', 'Feedback-ID', 'List-Id',
 'X-Notifications', 'X-Notifications-Bounce-Info', 'Message-ID', 'Subject',
 'From', 'To', 'Content-Type']

Hope it was useful 🙂

Answered By: Vinay Verma
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.