Scraping E-mails from Websites

Question:

I have tried several iterations from other posts and nothing seems to be helping or working for my needs.

I have a list of URLs that I want to loop through and pull all associated URLs that contain email addresses. I then want to store the URLs and Email Addresses into a csv file.

For example, if I went to 10torr.com, the program should find each of the sites within the main URL (ie: 10torr.com/about) and pull any emails.

Below is a list of 5 example websites that are currently in a data frame format when run through my code. They are saved under the variable small_site.

A helpful answer will include the use of the user defined function listed below called get_info(). Hard coding the the websites is into the Spider itself is not a feasible option as this will be used by many other people with different website list lengths.

    Website
    
Home
https://www.10000drops.com/ https://www.11wells.com/ https://117westspirits.com/ https://www.onpointdistillery.com/

Below is the code that I am running. The spider seems to run, but there is no output in my csv file.


import os
import pandas as pd
import re
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

small_site = site.head()


#%% Start Spider
class MailSpider(scrapy.Spider):

    name = 'email'

    def parse(self, response):

        links = LxmlLinkExtractor(allow=()).extract_links(response)
        links = [str(link.url) for link in links]
        links.append(str(response.url))

        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link) 

    def parse_link(self, response):

        for word in self.reject:
            if word in str(response.url):
                return

        html_text = str(response.text)
        mail_list = re.findall('w+@w+.{1}w+', html_text)

        dic = {'email': mail_list, 'link': str(response.url)}
        df = pd.DataFrame(dic)

        df.to_csv(self.path, mode='a', header=False)
        df.to_csv(self.path, mode='a', header=False)


#%% Preps a CSV File
def ask_user(question):
    response = input(question + ' y/n' + 'n')
    if response == 'y':
        return True
    else:
        return False
def create_file(path):
    response = False
    if os.path.exists(path):
        response = ask_user('File already exists, replace?')
        if response == False: return 

    with open(path, 'wb') as file: 
        file.close()


#%% Defines function that will extract emails and enter it into CSV
def get_info(url_list, path, reject=[]):

    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)


    print('Collecting Google urls...')
    google_urls = url_list


    print('Searching for emails...')
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    process.start() 

    for i in small_site.Website.iteritems():
        print('Searching for emails...')
        process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
        ##process.start()

        print('Cleaning emails...')
        df = pd.read_csv(path, index_col=0)
        df.columns = ['email', 'link']
        df = df.drop_duplicates(subset='email')
        df = df.reset_index(drop=True)
        df.to_csv(path, mode='w', header=True)


    return df


url_list = small_site
path = 'email.csv'

df = get_info(url_list, path)

I am not certain where I am going wrong as I am not getting any error messages. If you need additional information please just ask. I have been trying to get this for almost a month now and I feel like I am just banging my head against the wall at this point.

The majority of this code was found on the article Web scraping to extract contact information— Part 1: Mailing Lists after a few weeks. However, I have not been successful in expanding it to my needs. It worked no problem with one offs while incorporating their google search function to get the base URLs.

Thank you in advance for any assistance you are able to provide.

Asked By: Chris

||

Answers:

I modified some scripts are ran the following script via Shell and it works. May be it will provide you as an starting point.

I advise you to use the shell as it always throws errors and other messages during the scraping process


class MailSpider(scrapy.Spider):

    name = 'email'
    start_urls = [
        'http://10torr.com/',
        'https://www.10000drops.com/',
        'https://www.11wells.com/',
        'https://117westspirits.com/',
        'https://www.onpointdistillery.com/',
    ]

    def parse(self, response):
        self.log('A response from %s just arrived!' % response.url)
        links = LxmlLinkExtractor(allow=()).extract_links(response)
        links = [str(link.url) for link in links]
        links.append(str(response.url))

        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link) 

    def parse_link(self, response):

        html_text = str(response.text)
        mail_list = re.findall('w+@w+.{1}w+', html_text)

        dic = {'email': mail_list, 'link': str(response.url)}

        for key in dic.keys():
            yield {
                'email' : dic['email'],
                'link': dic['link'],
            }

This yields the following output when crawled via Anaconda shell scrapy crawl email -o test.jl

{"email": ["[email protected]"], "link": "https://117westspirits.com/"}
{"email": ["[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]"], "link": "https://www.11wells.com"}
{"email": ["[email protected]"], "link": "https://117westspirits.com/shop?olsPage=search&keywords="}
{"email": ["[email protected]"], "link": "https://117westspirits.com/shop?olsPage=search&keywords="}
{"email": ["[email protected]"], "link": "https://117westspirits.com/shop"}
{"email": ["[email protected]"], "link": "https://117westspirits.com/shop?olsPage=cart"}
{"email": ["[email protected]"], "link": "https://117westspirits.com/home"}
{"email": ["[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]"], "link": "https://www.11wells.com"}
{"email": ["[email protected]"], "link": "https://117westspirits.com/home"}
{"email": ["[email protected]"], "link": "https://117westspirits.com/117%C2%B0-west-spirits-1"}
...
...
...

Refer Scrapy docs for more information

Answered By: Nava Bogatee

It took awhile, but the answer finally came to me. The following is how the final answer came to be. This will work with a changing list as was the original question.

The change ended up being very minor. I needed to add the following user defined function.

def get_urls(io, sheet_name):
    data = pd.read_excel(io, sheet_name)
    urls = data['Website'].to_list()
    return urls

From there, it was a simple change to the get_info() user defined function. We needed to set google_urls in this function to our get_urls function and pass in the list. The full code for this function is below.

def get_info(io, sheet_name, path, reject=[]):
    
    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)
    
    print('Collecting Google urls...')
    google_urls = get_urls(io, sheet_name)
    
    print('Searching for emails...')
    process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
    process.start()
    
    print('Cleaning emails...')
    df = pd.read_csv(path, index_col=0)
    df.columns = ['email', 'link']
    df = df.drop_duplicates(subset='email')
    df = df.reset_index(drop=True)
    df.to_csv(path, mode='w', header=True)
    
    return df

No other changes were needed to get this to run. Hopefully this helps.

Answered By: Chris
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.