cannot scrape ratings
Question:
My issue is that I cannot use bs4 to scrape sub ratings in its reviews.
Below is an example:
So far, I have discovered where these stars are, but their codes are the same regardless of the color (i.e., green or grey)… I need to be able to identify the color to identify the ratings, not just scrape the stars. Below is my code:
url='https://www.glassdoor.com/Reviews/Walmart-Reviews-E715_P2.htm?filter.iso3Language=eng'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
com = soup.find(class_ = "ratingNumber mr-xsm")
com1 = soup.find(class_ = "gdReview")
com1_1 = com1.find(class_ = "content")
Answers:
For getting the star rating breakdown (which seems to have no numeric display or meta value), I don’t think there’s any very simple-and-straight-forward short method since it’s done by css in a style
tag connected by a class of the container element.
You could use something like soup.select('style:-soup-contains(".css-1nuumx7")')
[ the css-1nuumx7
part is specific to rating mentioned above], but :-soup-contains
needs html5lib
parser and can be a bit slow, so it’s better to figure out the data-emotion-css
attribute of the style
tag instead:
def getDECstars(starCont, mSoup, outOf=5, isv=False):
classList = starCont.get('class', [])
if type(classList) != list: classList = [classList]
classList = [str(c) for c in classList if str(c).startswith('css-')]
if not classList:
if isv: print('Stars container has no "css-" class')
return None
demc = classList[0].replace('css-', '', 1)
demc_sel = f'style[data-emotion-css="{demc}"]'
cssStyle = mSoup.select_one(demc_sel)
if not cssStyle:
if isv: print(f'Nothing found with selector {demc_sel}')
return None
cssStyle = cssStyle.get_text()
errMsg = ''
if '90deg,#0caa41 ' not in cssStyle: errMsg += 'No #0caa41'
if '%' not in cssStyle.split('90deg,#0caa41 ', 1)[-1][:20]:
errMsg += ' No %'
if not errMsg:
rPerc = cssStyle.split('90deg,#0caa41 ', 1)[-1]
rPerc = rPerc.split('%')[0]
try:
rPerc = float(rPerc)
if 0 <= rPerc <= 100:
if type(outOf) == int and outOf > 0: rPerc = (rPerc/100)*outOf
return float(f'{float(rPerc):.3}')
errMsg = f'{demc_sel} --> "{rPerc}" is out of range'
except: errMsg = f'{demc_sel} --> cannot convert to float "{rPerc}"'
if isv: print(f'{demc_sel} --> unexpected format {errMsg}')
return None
OR, if you don’t care so much about why there’s a missing rating:
def getDECstars(starCont, mSoup, outOf=5, isv=False):
try:
demc = [c for c in starCont.get('class', []) if c[:4]=='css-'][0].replace('css-', '', 1)
demc_sel = f'style[data-emotion-css="{demc}"]'
rPerc = float(mSoup.select_one(demc_sel).get_text().split('90deg,#0caa41 ', 1)[1].split('%')[0])
return float(f'{(rPerc/100)*outOf if type(outOf) == int and outOf > 0 else rPerc:.3}')
except: return None
Here’s an example of how you might use it:
pcCon = 'div.px-std:has(h2 > a.reviewLink) + div.px-std'
pcDiv = f'{pcCon} div.v2__EIReviewDetailsV2__fullWidth'
refDict = {
'rating_num': 'span.ratingNumber',
'emp_status': 'div:has(> div > span.ratingNumber) + span',
'header': 'h2 > a.reviewLink',
'subheader': 'h2:has(> a.reviewLink) + span',
'pros': f'{pcDiv}:first-of-type > p.pb',
'cons': f'{pcDiv}:nth-of-type(2) > p.pb'
}
subRatSel = 'div:has(> .ratingNumber) ~ aside ul > li:has(div ~ div)'
empRevs = []
for r in soup.select('li[id^="empReview_"]'):
rDet = {'reviewId': r.get('id')}
for sr in r.select(subRatSel):
k = sr.select_one('div:first-of-type').get_text(' ').strip()
sval = getDECstars(sr.select_one('div:nth-of-type(2)'), soup)
rDet[f'[rating] {k}'] = sval
for k, sel in refDict.items():
sval = r.select_one(sel)
if sval: sval = sval.get_text(' ').strip()
rDet[k] = sval
empRevs.append(rDet)
If empRevs
is viewed as a table:
reviewId
[rating] Work/Life Balance
[rating] Culture & Values
[rating] Diversity & Inclusion
[rating] Career Opportunities
[rating] Compensation and Benefits
[rating] Senior Management
rating_num
emp_status
header
subheader
pros
cons
empReview_71400593
5
4
4
4
5
3
3
great pay but bit of obnoxious enviornment
Nov 26, 2022 – Sales Associate/Cashier in Bensalem, PA
-Walmart’s fair pay policy is …
-some locations wont build emp…
empReview_70963705
3
3
2
2
2
2
2
Former Employee
Walmart Employees Trained Thrown to the Wolves
Nov 10, 2022 – Data Entry
Getting a snack at break was e…
I worked at Walmart for a very…
empReview_71415031
4
4
4
4
4
4
5
Current Employee, more than 1 year
Work
Nov 27, 2022 – Warehouse Associate in Springfield, GA
The money there is good during…
It can get stressful at times …
empReview_69136451
nan
nan
nan
nan
nan
nan
4
Current Employee
Walmart
Sep 16, 2022 – Sales Associate/Cashier
I’m a EXPERIENCED WORKER. I ✨…
In my opinion I believe that W…
empReview_71398525
4
3
4
3
4
3
4
Current Employee
Depends heavily on your team
Nov 26, 2022 – Personal Digital Shopper
I have a generally excellent t…
Generally, departments are sho…
empReview_71227029
1
1
1
1
3
1
1
Former Employee, less than 1 year
Managers are treated like a slave.
Nov 19, 2022 – Auto Care Center Manager (ACCM) in Cottonwood, AZ
Great if you like working with…
you only get to work in your a…
empReview_71329467
1
3
3
3
4
1
1
Current Employee, more than 3 years
No more values
Nov 23, 2022 – GM Coach in Houston, TX
Pay compare to other retails a…
Walmart is not a bad company t…
empReview_71512609
5
5
5
5
5
5
5
Former Employee
Walmart midnight stocker
Nov 30, 2022 – Midnight Stocker in Taylor, MI
2 paid 15 min breaks and 1 hou…
Honestly nothing that I can th…
empReview_70585957
3
4
4
4
4
4
4
Former Employee
Lots of Opportunity
Oct 28, 2022 – Human Resources People Lead
Plenty of opportunities if one…
As with any job, management is…
empReview_71519435
3
4
4
5
4
4
5
Current Employee, more than 3 years
Lot of work but worth it
Nov 30, 2022 – People Lead
I enjoy making associates live…
Sometimes an overwhelming amou…
Markdown for the table above was printed with pandas:
erdf = pandas.DataFrame(empRevs).set_index('reviewId')
erdf['pros'] = [p[:30] + '...' if len(p) > 33 else p for p in erdf['pros']]
erdf['cons'] = [p[:30] + '...' if len(p) > 33 else p for p in erdf['cons']]
print(erdf.to_markdown())
My issue is that I cannot use bs4 to scrape sub ratings in its reviews.
Below is an example:
So far, I have discovered where these stars are, but their codes are the same regardless of the color (i.e., green or grey)… I need to be able to identify the color to identify the ratings, not just scrape the stars. Below is my code:
url='https://www.glassdoor.com/Reviews/Walmart-Reviews-E715_P2.htm?filter.iso3Language=eng'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
com = soup.find(class_ = "ratingNumber mr-xsm")
com1 = soup.find(class_ = "gdReview")
com1_1 = com1.find(class_ = "content")
For getting the star rating breakdown (which seems to have no numeric display or meta value), I don’t think there’s any very simple-and-straight-forward short method since it’s done by css in a style
tag connected by a class of the container element.
You could use something like soup.select('style:-soup-contains(".css-1nuumx7")')
[ the css-1nuumx7
part is specific to rating mentioned above], but :-soup-contains
needs html5lib
parser and can be a bit slow, so it’s better to figure out the data-emotion-css
attribute of the style
tag instead:
def getDECstars(starCont, mSoup, outOf=5, isv=False):
classList = starCont.get('class', [])
if type(classList) != list: classList = [classList]
classList = [str(c) for c in classList if str(c).startswith('css-')]
if not classList:
if isv: print('Stars container has no "css-" class')
return None
demc = classList[0].replace('css-', '', 1)
demc_sel = f'style[data-emotion-css="{demc}"]'
cssStyle = mSoup.select_one(demc_sel)
if not cssStyle:
if isv: print(f'Nothing found with selector {demc_sel}')
return None
cssStyle = cssStyle.get_text()
errMsg = ''
if '90deg,#0caa41 ' not in cssStyle: errMsg += 'No #0caa41'
if '%' not in cssStyle.split('90deg,#0caa41 ', 1)[-1][:20]:
errMsg += ' No %'
if not errMsg:
rPerc = cssStyle.split('90deg,#0caa41 ', 1)[-1]
rPerc = rPerc.split('%')[0]
try:
rPerc = float(rPerc)
if 0 <= rPerc <= 100:
if type(outOf) == int and outOf > 0: rPerc = (rPerc/100)*outOf
return float(f'{float(rPerc):.3}')
errMsg = f'{demc_sel} --> "{rPerc}" is out of range'
except: errMsg = f'{demc_sel} --> cannot convert to float "{rPerc}"'
if isv: print(f'{demc_sel} --> unexpected format {errMsg}')
return None
OR, if you don’t care so much about why there’s a missing rating:
def getDECstars(starCont, mSoup, outOf=5, isv=False):
try:
demc = [c for c in starCont.get('class', []) if c[:4]=='css-'][0].replace('css-', '', 1)
demc_sel = f'style[data-emotion-css="{demc}"]'
rPerc = float(mSoup.select_one(demc_sel).get_text().split('90deg,#0caa41 ', 1)[1].split('%')[0])
return float(f'{(rPerc/100)*outOf if type(outOf) == int and outOf > 0 else rPerc:.3}')
except: return None
Here’s an example of how you might use it:
pcCon = 'div.px-std:has(h2 > a.reviewLink) + div.px-std'
pcDiv = f'{pcCon} div.v2__EIReviewDetailsV2__fullWidth'
refDict = {
'rating_num': 'span.ratingNumber',
'emp_status': 'div:has(> div > span.ratingNumber) + span',
'header': 'h2 > a.reviewLink',
'subheader': 'h2:has(> a.reviewLink) + span',
'pros': f'{pcDiv}:first-of-type > p.pb',
'cons': f'{pcDiv}:nth-of-type(2) > p.pb'
}
subRatSel = 'div:has(> .ratingNumber) ~ aside ul > li:has(div ~ div)'
empRevs = []
for r in soup.select('li[id^="empReview_"]'):
rDet = {'reviewId': r.get('id')}
for sr in r.select(subRatSel):
k = sr.select_one('div:first-of-type').get_text(' ').strip()
sval = getDECstars(sr.select_one('div:nth-of-type(2)'), soup)
rDet[f'[rating] {k}'] = sval
for k, sel in refDict.items():
sval = r.select_one(sel)
if sval: sval = sval.get_text(' ').strip()
rDet[k] = sval
empRevs.append(rDet)
If empRevs
is viewed as a table:
reviewId | [rating] Work/Life Balance | [rating] Culture & Values | [rating] Diversity & Inclusion | [rating] Career Opportunities | [rating] Compensation and Benefits | [rating] Senior Management | rating_num | emp_status | header | subheader | pros | cons |
---|---|---|---|---|---|---|---|---|---|---|---|---|
empReview_71400593 | 5 | 4 | 4 | 4 | 5 | 3 | 3 | great pay but bit of obnoxious enviornment | Nov 26, 2022 – Sales Associate/Cashier in Bensalem, PA | -Walmart’s fair pay policy is … | -some locations wont build emp… | |
empReview_70963705 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | Former Employee | Walmart Employees Trained Thrown to the Wolves | Nov 10, 2022 – Data Entry | Getting a snack at break was e… | I worked at Walmart for a very… |
empReview_71415031 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | Current Employee, more than 1 year | Work | Nov 27, 2022 – Warehouse Associate in Springfield, GA | The money there is good during… | It can get stressful at times … |
empReview_69136451 | nan | nan | nan | nan | nan | nan | 4 | Current Employee | Walmart | Sep 16, 2022 – Sales Associate/Cashier | I’m a EXPERIENCED WORKER. I ✨… | In my opinion I believe that W… |
empReview_71398525 | 4 | 3 | 4 | 3 | 4 | 3 | 4 | Current Employee | Depends heavily on your team | Nov 26, 2022 – Personal Digital Shopper | I have a generally excellent t… | Generally, departments are sho… |
empReview_71227029 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | Former Employee, less than 1 year | Managers are treated like a slave. | Nov 19, 2022 – Auto Care Center Manager (ACCM) in Cottonwood, AZ | Great if you like working with… | you only get to work in your a… |
empReview_71329467 | 1 | 3 | 3 | 3 | 4 | 1 | 1 | Current Employee, more than 3 years | No more values | Nov 23, 2022 – GM Coach in Houston, TX | Pay compare to other retails a… | Walmart is not a bad company t… |
empReview_71512609 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | Former Employee | Walmart midnight stocker | Nov 30, 2022 – Midnight Stocker in Taylor, MI | 2 paid 15 min breaks and 1 hou… | Honestly nothing that I can th… |
empReview_70585957 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | Former Employee | Lots of Opportunity | Oct 28, 2022 – Human Resources People Lead | Plenty of opportunities if one… | As with any job, management is… |
empReview_71519435 | 3 | 4 | 4 | 5 | 4 | 4 | 5 | Current Employee, more than 3 years | Lot of work but worth it | Nov 30, 2022 – People Lead | I enjoy making associates live… | Sometimes an overwhelming amou… |
Markdown for the table above was printed with pandas:
erdf = pandas.DataFrame(empRevs).set_index('reviewId') erdf['pros'] = [p[:30] + '...' if len(p) > 33 else p for p in erdf['pros']] erdf['cons'] = [p[:30] + '...' if len(p) > 33 else p for p in erdf['cons']] print(erdf.to_markdown())