Using threading to make web requests/scrape data, seems list storing result is being overwritten somewhere

Question

I’m trying to scrape data from yellowpages.com, they keep a list of cities starting with a certain letter in a certain state at the following url https://www.yellowpages.com/state-<state-abbreviation>?page=<letter>, so all cities in New York starting with the letter ‘c’ would be https://www.yellowpages.com/state-ny?page=c, for example.

Ultimately, I’m trying to write every city, state combo to a variable, locations, and then to a file. When I initially went to do this, I just built the list of urls, looped over and sent one request at a time. This was taking forever so I discovered threading and am trying to implement that.

When I run this program, the logging code I added shows it making a request to all 1300 pages(50 states*26 letters), but only the last state in my states variable, Wyoming, gets written to the file. It will write cities A-Z for the state of Wyoming to a file, but nothing else.

My code:

def get_session():
    if not hasattr(thread_local, 'session'):
        thread_local.session = requests.Session()
    return thread_local.session

def download_site(url):
    """ Make request to url and scrape data using bs4"""
    session = get_session()
    with session.get(url) as response:
        logging.info(f"Read {len(response.content)} from {url}")
        scrape_data(response)

def download_all_sites(urls):
    """ call download_site() on list of urls"""
    with concurrent.futures.ThreadPoolExecutor(max_workers = 50) as executor:
        executor.map(download_site, urls)


def scrape_data(response):
    """uses bs4 to get city, state combo from yellowpages html and appends to global locations list"""
    soup = BeautifulSoup(response.text, 'html.parser')
    ul_elements = soup.find_all('ul')
    for ul_element in ul_elements:
        anchor_elements = ul_element.find_all('a')
        for element in anchor_elements:
            locations.append(element.text + ',' + state_abbrieviated)

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)

    urls = [] # will hold yellowpages urls
    locations = [] # will hold scraped 'city, state' combinations,  modified by scrape_data() function 

    states = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
    }
    letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o',
               'p','q','r','s','t','u','v','w','x','y','z']

    # build list of urls that need to be scrape
    for state_abbrieviated ,state_full in states.items():
        for letter in letters:
            url = f'https://www.yellowpages.com/state-{state_abbrieviated}?page={letter}'
            urls.append(url)

    # scrape data
     download_all_sites(urls)
     logging.info(f"tSent/Retrieved {len(urls)} requests/responses in {duration} seconds")

     # write data to file
     with open('locations.txt','w') as file:
     for location in locations:
        file.write(location + 'n')

So given that only the last state gets written to the file, it seems my locations list variable is being overwritten every time the code moves to scrape data for a new state?

The question to this title is vague because I’ve stared and thought about this for awhile now and I am not too sure where the problem is/ I don’t know what I don’t know. I’m not sure if this is an issue with threading or if I messed up somewhere else. Anyway, if anyone looks at this and can spot the problem, thank you very much!

Asked By: Justin

||

Source

Answer 1

Opening file in write mode deletes it’s content. Change flag to ‘a’ or ‘a+’. https://stackoverflow.com/a/1466036/9256726

My opinion: try to save each state into separate file and merge them after all pages successfully scrap.

Also, try multithreading instead of threading, beacuse it doesn’t solve performance issues (threads are sharing global lock). There is a good video why and how you can do it: https://www.youtube.com/watch?v=X7vBbelRXn0

Answered By: kpostekk

Using threading to make web requests/scrape data, seems list storing result is being overwritten somewhere

Question:

Answers: