cannot scrape ratings

Question:

My issue is that I cannot use bs4 to scrape sub ratings in its reviews.
Below is an example:

So far, I have discovered where these stars are, but their codes are the same regardless of the color (i.e., green or grey)… I need to be able to identify the color to identify the ratings, not just scrape the stars. Below is my code:

url='https://www.glassdoor.com/Reviews/Walmart-Reviews-E715_P2.htm?filter.iso3Language=eng'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
com = soup.find(class_ = "ratingNumber mr-xsm")
com1 = soup.find(class_ = "gdReview")
com1_1 = com1.find(class_ = "content")
Asked By: Jaevapple

||

Answers:

For getting the star rating breakdown (which seems to have no numeric display or meta value), I don’t think there’s any very simple-and-straight-forward short method since it’s done by css in a style tag connected by a class of the container element.

You could use something like soup.select('style:-soup-contains(".css-1nuumx7")') [ the css-1nuumx7 part is specific to rating mentioned above], but :-soup-contains needs html5lib parser and can be a bit slow, so it’s better to figure out the data-emotion-css attribute of the style tag instead:

def getDECstars(starCont, mSoup, outOf=5,  isv=False):
    classList = starCont.get('class', [])
    if type(classList) != list: classList = [classList]
    classList = [str(c) for c in classList if str(c).startswith('css-')] 
    if not classList: 
        if isv: print('Stars container has no "css-" class')
        return None
    
    demc = classList[0].replace('css-', '', 1)
    demc_sel = f'style[data-emotion-css="{demc}"]'
    cssStyle = mSoup.select_one(demc_sel)
    if not cssStyle:
        if isv: print(f'Nothing found with selector {demc_sel}')
        return None
    
    cssStyle = cssStyle.get_text()
    errMsg = ''
    if '90deg,#0caa41 ' not in cssStyle: errMsg += 'No #0caa41'
    if '%' not in cssStyle.split('90deg,#0caa41 ', 1)[-1][:20]: 
        errMsg += ' No %'
    if not errMsg:
        rPerc = cssStyle.split('90deg,#0caa41 ', 1)[-1]
        rPerc = rPerc.split('%')[0]
        try:  
            rPerc = float(rPerc)
            if 0 <= rPerc <= 100:
                if type(outOf) == int and outOf > 0: rPerc = (rPerc/100)*outOf
                return float(f'{float(rPerc):.3}')
            errMsg = f'{demc_sel} --> "{rPerc}" is out of range'
        except: errMsg = f'{demc_sel} --> cannot convert to float "{rPerc}"' 
    if isv: print(f'{demc_sel} --> unexpected format {errMsg}')
    return None

OR, if you don’t care so much about why there’s a missing rating:

def getDECstars(starCont, mSoup, outOf=5, isv=False):
    try:
        demc = [c for c in starCont.get('class', []) if c[:4]=='css-'][0].replace('css-', '', 1)
        demc_sel = f'style[data-emotion-css="{demc}"]'
        rPerc = float(mSoup.select_one(demc_sel).get_text().split('90deg,#0caa41 ', 1)[1].split('%')[0])
        return float(f'{(rPerc/100)*outOf if type(outOf) == int and outOf > 0 else rPerc:.3}')
    except: return None

Here’s an example of how you might use it:

pcCon = 'div.px-std:has(h2 > a.reviewLink) + div.px-std'
pcDiv = f'{pcCon} div.v2__EIReviewDetailsV2__fullWidth'
refDict = {
    'rating_num': 'span.ratingNumber',
    'emp_status': 'div:has(> div > span.ratingNumber) + span',
    'header': 'h2 > a.reviewLink',
    'subheader': 'h2:has(> a.reviewLink) + span',
    'pros': f'{pcDiv}:first-of-type > p.pb',
    'cons': f'{pcDiv}:nth-of-type(2) > p.pb'
}

subRatSel = 'div:has(> .ratingNumber) ~ aside ul > li:has(div ~ div)'
empRevs = []
for r in soup.select('li[id^="empReview_"]'):
    rDet = {'reviewId': r.get('id')}
    for sr in r.select(subRatSel):
        k = sr.select_one('div:first-of-type').get_text(' ').strip()
        sval = getDECstars(sr.select_one('div:nth-of-type(2)'), soup)
        rDet[f'[rating] {k}'] = sval
    
    for k, sel in refDict.items():
        sval = r.select_one(sel)
        if sval: sval = sval.get_text(' ').strip()
        rDet[k] = sval
    
    empRevs.append(rDet)

If empRevs is viewed as a table:

reviewId [rating] Work/Life Balance [rating] Culture & Values [rating] Diversity & Inclusion [rating] Career Opportunities [rating] Compensation and Benefits [rating] Senior Management rating_num emp_status header subheader pros cons
empReview_71400593 5 4 4 4 5 3 3 great pay but bit of obnoxious enviornment Nov 26, 2022 – Sales Associate/Cashier in Bensalem, PA -Walmart’s fair pay policy is … -some locations wont build emp…
empReview_70963705 3 3 2 2 2 2 2 Former Employee Walmart Employees Trained Thrown to the Wolves Nov 10, 2022 – Data Entry Getting a snack at break was e… I worked at Walmart for a very…
empReview_71415031 4 4 4 4 4 4 5 Current Employee, more than 1 year Work Nov 27, 2022 – Warehouse Associate in Springfield, GA The money there is good during… It can get stressful at times …
empReview_69136451 nan nan nan nan nan nan 4 Current Employee Walmart Sep 16, 2022 – Sales Associate/Cashier I’m a EXPERIENCED WORKER. I ✨… In my opinion I believe that W…
empReview_71398525 4 3 4 3 4 3 4 Current Employee Depends heavily on your team Nov 26, 2022 – Personal Digital Shopper I have a generally excellent t… Generally, departments are sho…
empReview_71227029 1 1 1 1 3 1 1 Former Employee, less than 1 year Managers are treated like a slave. Nov 19, 2022 – Auto Care Center Manager (ACCM) in Cottonwood, AZ Great if you like working with… you only get to work in your a…
empReview_71329467 1 3 3 3 4 1 1 Current Employee, more than 3 years No more values Nov 23, 2022 – GM Coach in Houston, TX Pay compare to other retails a… Walmart is not a bad company t…
empReview_71512609 5 5 5 5 5 5 5 Former Employee Walmart midnight stocker Nov 30, 2022 – Midnight Stocker in Taylor, MI 2 paid 15 min breaks and 1 hou… Honestly nothing that I can th…
empReview_70585957 3 4 4 4 4 4 4 Former Employee Lots of Opportunity Oct 28, 2022 – Human Resources People Lead Plenty of opportunities if one… As with any job, management is…
empReview_71519435 3 4 4 5 4 4 5 Current Employee, more than 3 years Lot of work but worth it Nov 30, 2022 – People Lead I enjoy making associates live… Sometimes an overwhelming amou…

Markdown for the table above was printed with pandas:

erdf = pandas.DataFrame(empRevs).set_index('reviewId')
erdf['pros'] = [p[:30] + '...' if len(p) > 33 else p for p in erdf['pros']]
erdf['cons'] = [p[:30] + '...' if len(p) > 33 else p for p in erdf['cons']]
print(erdf.to_markdown())
Answered By: Driftr95