reviews of a firm

Question:

My goal is to scrape the entire reviews of this firm. I tried manipulating @Driftr95 codes:

def extract(pg): 
    headers = {'user-agent' : 'Mozilla/5.0'}
    url = f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{pg}.htm?filter.iso3Language=eng'
    # f'https://www.glassdoor.com/Reviews/Google-Engineering-Reviews-EI_IE9079.0,6_DEPT1007_IP{pg}.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'

    r = requests.get(url, headers)
    soup = BeautifulSoup(r.content, 'html.parser')# this a soup function that retuen the whole html
    return soup

for j in range(1,21,10):
    for i in range(j+1,j+11,1): #3M: 4251 reviews
        soup = extract( f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{i}.htm?filter.iso3Language=eng')
        print(f' page {i}')
        for r in soup.select('li[id^="empReview_"]'):
            rDet = {'reviewId': r.get('id')}
            for sr in r.select(subRatSel):
                k = sr.select_one('div:first-of-type').get_text(' ').strip()
                sval = getDECstars(sr.select_one('div:nth-of-type(2)'), soup)
                rDet[f'[rating] {k}'] = sval
    
            for k, sel in refDict.items():
                sval = r.select_one(sel)
                if sval: sval = sval.get_text(' ').strip()
                rDet[k] = sval
    
            empRevs.append(rDet) 

In the case where not all the subratings are always available, all four subratings will turn out to be N.A.

Asked By: Jaevapple

||

Answers:

All four subratings will turn out to be N.A.

there were some things that I didn’t account for because I hadn’t encountered them before, but the updated version of getDECstars shouldn’t have that issue. (If you use the longer version with argument isv=True, it’s easier to debug and figure out what’s missing from the code…)


I scraped 200 reviews in this case, and it turned out that only 170 unique reviews

Duplicates are fairly easy to avoid by maintaining a list of reviewIds that have already been added and checking against it before adding a new review to empRevs

scrapedIds = []
# for...
    # for ###
        # soup = extract...

        # for r in ...
            if r.get('id') in scrapedIds: continue # skip duplicate
            ## rDet = ..... ## AND REST OF INNER FOR-LOOP ##

            empRevs.append(rDet) 
            scrapedIds.append(rDet['reviewId']) # add to list of ids to check against

Https tends to time out after 100 rounds…

You could try adding breaks and switching out user-agents every 50 [or 5 or 10 or…] requests, but I’m quick to resort to selenium at times like this; this is my suggested solution – if you just call it like this and pass a url to start with:

## PASTE [OR DOWNLOAD&IMPORT] from https://pastebin.com/RsFHWNnt ##

startUrl = 'https://www.glassdoor.com/Reviews/3M-Reviews-E446.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'
scrape_gdRevs(startUrl, 'empRevs_3M.csv', maxScrapes=1000, constBreak=False)

[last 3 lines of] printed output:

 total reviews:  4252
total reviews scraped this run: 4252
total reviews scraped over all time: 4252

It clicks through the pages until it reaches the last page (or maxes out maxScrapes). You do have to log in at the beginning though, so fill out login_to_gd with your username and password or log in manually by replacing the login_to_gd(driverG) line with the input(...) line that waits for you to login [then press ENTER in the terminal] before continuing.

I think cookies can also be used instead (with requests), but I’m not good at handling that. If you figure it out, then you can use some version of linkToSoup or your extract(pg); then, you’ll have to comment out or remove the lines ending in ## for selenium and uncomment [or follow instructions from] the lines that end with ## without selenium. [But please note that I’ve only fully tested the selenium version.]

The CSVs [like "empRevs_3M.csv" and "scrapeLogs_empRevs_3M.csv" in this example] are updated after every page-scrape, so even if the program crashes [or you decide to interrupt it], it will have saved upto the previous scrape. Since it also tries to load form the CSVs at the beginning, you can just continue it later (just set startUrl to the url of the page you want to continue from – but even if it’s at page 1, remember that duplicates will be ignored, so it’s okay – it’ll just waste some time though).

Answered By: Driftr95