Extract information from Gmail with Python

Question:

I have come through solutions to extract useful information from selected received emails in Gmail mailbox.

Aim in this example is to fetch all mails sent from a newsletter providing monthly prices for petroleum. You can freely subscribe to such a newsletter on EIA website. All such newsletter arrive in same folder in my gmail mailbox, and begin with “$”.

Content for emails is like that

enter image description here

and my objective is to write a script that fetch the 10 last such emails (last 10 months) and plot petroleum prices for the different US regions with respect to time.

Asked By: kiriloff

||

Answers:

Python email library will help.

import email, getpass, imaplib, os, re
import matplotlib.pyplot as plt

This directory is where you will save attachments

 detach_dir = "F:OTHERSCSPYTHONPROJECTS"  

Your script then asks user (or yourself) for account features

user = raw_input("Enter your GMail username --> ")
pwd = getpass.getpass("Enter your password --> ")

Connect then to the gmail imap server and login

m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user, pwd)

Select one folder, you could use the whole INBOX instead

m.select("BUSINESS/PETROLEUM")    

One should use m.list() to get all the mailboxes. Search for all emails coming from specified sender and
select the mail ids:

resp, items = m.search(None, '(FROM "[email protected]")')
items = items[0].split()  

my_msg = [] # store relevant msgs here in please
msg_cnt = 0
break_ = False

I want the last emails, so that I am using items[::-1]

for emailid in items[::-1]:

    resp, data = m.fetch(emailid, "(RFC822)")

    if ( break_ ):
        break

    for response_part in data:

      if isinstance(response_part, tuple):
          msg = email.message_from_string(str(response_part[1]))
          varSubject = msg['subject']
          varDate = msg['date']

I want only the ones beginning with $

          if varSubject[0] == '$':
              r, d = m.fetch(emailid, "(UID BODY[TEXT])")

              ymd = email.utils.parsedate(varDate)[0:3]
              my_msg.append([ email.message_from_string(d[0][1]) , ymd ])

              msg_cnt += 1

I want only the N=100 last messages

              if ( msg_cnt == 100 ):
                  break_ = True

l = len(my_msg)
US, EastCst, NewEng, CenAtl, LwrAtl, Midwst, GulfCst, RkyMt, WCst, CA = 
[0]*l, [0]*l, [0]*l, [0]*l, [0]*l, [0]*l, [0]*l, [0]*l, [0]*l, [0]*l 
absc = [k for k in range(len(my_msg))]
dates = [str(msg[1][2])+'-'+str(msg[1][3])+'-'+str(msg[1][0]) for msg in my_msg]
cnt = -1

for msg in my_msg:

    data = str(msg[0]).split("n")
    cnt+=1
    for c in [k.split("r")[0] for k in data[2:-2]]: 

Use regular expressions to fetch relevant information

        m = re.match( r"(.+)(=3D$)(.+)" , c )  
        if( m == None ):
            continue 

        country, na, price = m.groups()

        if ( country == "US" or country == "USA" ) :
            US[cnt] = float(price)
        elif( country == "NewEng" ) :
            EastCst[cnt] = float(price)    
        elif( country == "EastCst" ) :
            NewEng[cnt] = float(price)  
        elif( country == "EastCst" ) :
            CenAtl[cnt] = float(price) 
        elif( country == "EastCst" ) :
            LwrAtl[cnt] = float(price)
        elif( country == "EastCst" ) :
            Midwst[cnt] = float(price)
        elif( country == "EastCst" ) :
            GulfCst[cnt] = float(price)
        elif( country == "EastCst" ) :
            RkyMt[cnt] = float(price)
        elif( country == "EastCst" ) :
            WCst[cnt] = float(price)
        elif( country == "EastCst" ) :
            CA[cnt] = float(price)

Plot all these curves with US prices

plt.plot( absc, US )

plt.plot( absc, EastCst )    
plt.plot( absc, NewEng, '#251BE0' )    
plt.plot( absc, EastCst, '#1BE0BF' )
plt.plot( absc, CenAtl, '#E0771B' )
plt.plot( absc, LwrAtl, '#CC1BE0' )
plt.plot( absc, Midwst, '#E01B8B' ) 
plt.plot( absc, GulfCst, '#E01B3F' )
plt.plot( absc, RkyMt )
plt.plot( absc, WCst )
plt.plot( absc, CA )

plt.legend( ('US', 'EastCst', 'NewEng' , 'EastCst', 'CenAtl', 'LwrAtl', 'Midwst', 'GulfCst', 'RkyMt', 'WCst', 'CA')  )
plt.title('Diesel price')
locs,labels = plt.xticks(absc, dates)
plt.show()

Some related interesting topics are here

Get only new emails

Fetch mail body

Forward emails with attachment

Fetch body emails in gmail

Results are here for three areas only

us prices

Answered By: kiriloff

Here is an example of how to do it with Red Box (disclaimer, I’m the author).

First, configure Gmail’s application password.

Second, get the messages:

from redbox import gmail

# Set credentials
gmail.username = "[email protected]"
gmail.password = "<PASSWORD>"

# Select an email folder
inbox = gmail["INBOX"]

# Search and process messages
msgs = inbox.search(from_="[email protected]")

Third, form the dataframe:

import pandas as pd

dfs = []
for msg in msgs:
    body = msg.text_body
    df = pd.DataFrame(
        line.split("=")
        for line in body.split("n),
        columns=["Region", "Price"]
    )
    df["Date"] = msg.date
    df["Price"] = df["Price"].str[1:].astype("float")
    dfs.append(df)

df = pd.concat(dfs)

Finally plot the dataframe:

df.pivot_table(
    index="Date", 
    columns="Region", 
    values="Price", 
    aggfunc="sum"
).plot()

To install:

pip install redbox pandas matplotlib

Links to Red Box:

Answered By: miksus
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.