Trouble scraping data rows – beautifulSoup
Question:
Beginner working with python and beautiful soup, attempting to scrape election results data from a state elections page. Went by the book ‘learning to code with baseball’ to learn all of my basics, including the 5th chapter which covers scraping.
I am working on scraping one table from the site, which looks like this:
Candidate
Total Votes
Pct
Abraham Lincoln
53990
42.1%
George Washington
37326
29.1%
After using BeautifulSoup to read the entire site and identify the tables. I was successful in isolating this table from the rest of the tables on the site and identifying the header row using:
gov_table = tables[3]
rows = gov_table.find_all('tr')
header_row = rows[0]
The trouble i ran into was with the data rows. I cannot seem to pick up the candidate’s names, only their ‘total votes’ and ‘pct’.
I try:
first_data_row = rows[1]
first_data_row.find_all('td')
which gives the HTML:
[<td class="candidate" data-title="Candidate" scope="row">ABRAHAM LINCOLN <span class="smalltext">(DEM)</span> </td>,
<td class="number mail-in" width="25%">
<ul class="mailinbreakout">
<li>Polling place: 51771</li>
<li>Mail ballots: 2219</li>
</ul>
</td>,
<td class="number total votes" data-title="Total votes">53990</td>,
<td class="number total percent" data-title="Pct">42.1%</td>]
I then attempt to run a comprehension on all the td tags to isolate them in a list, which I will use as the rows to a DataFrame. But the trouble I have is, I cannot seem to pick up the candidates name:
In [82]: [str(x.string) for x in first_data_row.find_all('td')]
Out[82]: ['None', 'None', '53990', '42.1%']
I’m really stumped about the ‘None’ strings as they dont appear anywhere in the table rows themselves. I have tried narrowing in on it further using
In [83]: [str(x.string) for x in first_data_row.find_all('td', {'scope': 'row'})]
Out[83]: ['None']
or
In[87]: first_candidate_name = first_data_row.find_all('td')[0]
...first_candidate_name
...str(first_candidate_name.string)
Out[87]: 'None'
With similar results.
I am sure I am missing something relatively minor but my beginning eyes can’t narrow it down.
Answers:
This should help you resolve the conundrum. Let me know if something is not clear.
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.ri.gov/election/results/2014/statewide_primary"
r = requests.get(url)
soup = bs(r.content, 'html.parser')
table_5 = soup.find_all('table')[5]
trs = table_5.find_all('tr')
tds = trs[1].find_all('td')
print(tds[0].text)
print(tds[2].text)
print(tds[3].text)
Output
Nellie M. GORBEA (DEM)
58444
51.4%
You’re using .string
to access the content within the rows, and some of these rows have multiple children, which means .string
will return None
On the other hand, .get_text()
returns all the strings of the children concatenated into one string
> [str(x.string) for x in first_data_row.find_all('td')]
> ['None', 'None', '53990', '42.1%']
> [str(x.get_text()) for x in first_data_row.find_all('td')]
> ['Gina M. RAIMONDO (DEM) ', 'nnPollingxa0place:xa051771nMailxa0ballots:xa02219nn', '53990', '42.1%']
From the documentation:
.string
- If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
- If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:
- If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
.get_text()
If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:
To pull Candidate
, Total Vote
s and Pct
, you can invoke stripped_strings
method then list slicing.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.ri.gov/election/results/2014/statewide_primary/#"
res = requests.post(url)
soup = BeautifulSoup(res.text,'lxml')
data = []
for tr in soup.select('div.raceResults table tbody tr'):
t = list(tr.stripped_strings)
data.append({
'Candidate': t[0] + t[1],
'Total votes': t[-2],
'Pct':t[-1]
})
df = pd.DataFrame(data)
print(df)
Output:
Candidate Total votes Pct
0 John F. REED(DEM) 98610 100.0%
1 David N. CICILLINE(DEM) 38186 63.0%
2 Matthew J. FECTEAU(DEM) 22447 37.0%
3 James R. LANGEVIN(DEM) 44512 100.0%
4 Gina M. RAIMONDO(DEM) 53990 42.1%
5 Angel TAVERAS(DEM) 37326 29.1%
6 H. Claiborne PELL(DEM) 34515 26.9%
7 Todd GIROUX(DEM) 2264 1.8%
8 Daniel J. McKEE(DEM) 50229 43.0%
9 A. Ralph MOLLIS(DEM) 42525 36.4%
10 Frank G. FERRI(DEM) 23970 20.5%
11 Nellie M. GORBEA(DEM) 58444 51.4%
12 Guillaume DE RAMEL(DEM) 55237 48.6%
13 Peter F. KILMARTIN(DEM) 91021 100.0%
14 Seth MAGAZINER(DEM) 80378 66.5%
15 Frank T. CAPRIO(DEM) 40402 33.5%
16 Mark S. ZACCARIA(REP) 23780 100.0%
17 Cormick Brendan LYNCH(REP) 6527 72.4%
18 Stanford TRAN(REP) 2483 27.6%
19 Rhue R. REIS(REP) 14143 100.0%
20 Allan W. FUNG(REP) 17530 54.9%
21 Kenneth J. BLOCK(REP) 14399 45.1%
22 Catherine Terry TAYLOR(REP) 17722 66.7%
23 Kara D. YOUNG(REP) 8831 33.3%
24 John M. CARLEVALE, SR.(REP) 23232 100.0%
25 Dawson Tucker HODGSON(REP) 23795 100.0%
You could simply loop the h3 within the results to get the positions/titles and use find_next to get the adjacent tables. Add the position to each table, read into a df using pandas, add each table to a list to turn into a single dataframe at the end with pd.concat. You can also extract, for example, the party from the candidate.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.ri.gov/election/results/2014/statewide_primary/#"
page = requests.get(url, headers = {'user-agent':'mozilla/5.0'})
soup = BeautifulSoup(page.content, "html.parser")
dfs = []
for position in soup.select('.results h3'):
title = position.a.text
df = pd.read_html(str(position.find_next('table')))[0]
df.loc[:, 'Title'] = title
dfs.append(df)
final = pd.concat(dfs, axis = 0)
final[['Candidate','Party']] = final['Candidate'].str.split(pat = ' (|)', n = 2 , regex= True, expand=True).iloc[:, :-1]
final[['Polling place', 'Mail ballots']] = final['Ballot breakout'].str.split(pat = '[Ws]+', regex= True, expand=True).iloc[:, [2,5]].astype(int)
final = final[['Title', 'Party', 'Candidate', 'Total votes', 'Pct', 'Polling place', 'Mail ballots']].sort_values(['Title', 'Party', 'Total votes'], ascending = [True, True, False])
final.reset_index(drop = True, inplace = True)
Sample result rows:
In addition, simplest approach would be to use pandas.read_html()
that works with BeautifulSoup
under the hood and is best practice to scrape table data.
import pandas as pd
pd.concat(pd.read_html('https://www.ri.gov/election/results/2014/statewide_primary/#'), ignore_index=True)
Output:
Candidate
Ballot breakout
Total votes
Pct
0
John F. REED (DEM)
Polling place: 94157 Mail ballots: 4453
98610
100.0%
1
David N. CICILLINE (DEM)
Polling place: 36220 Mail ballots: 1966
38186
63.0%
2
Matthew J. FECTEAU (DEM)
Polling place: 21637 Mail ballots: 810
22447
37.0%
3
James R. LANGEVIN (DEM)
Polling place: 42740 Mail ballots: 1772
44512
100.0%
4
Gina M. RAIMONDO (DEM)
Polling place: 51771 Mail ballots: 2219
53990
42.1%
…
Or using get_text()
:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.ri.gov/election/results/2014/statewide_primary/#"
res = requests.post(url)
soup = BeautifulSoup(res.text)
data = []
for tr in soup.select('div.raceResults table tbody tr'):
d = [t.get_text(' ',strip=True) for t in tr.select('td')]
del d[1]
data.append(dict(zip(['Candidate','Total votes','Pct'],d)))
pd.DataFrame(data)
Beginner working with python and beautiful soup, attempting to scrape election results data from a state elections page. Went by the book ‘learning to code with baseball’ to learn all of my basics, including the 5th chapter which covers scraping.
I am working on scraping one table from the site, which looks like this:
Candidate | Total Votes | Pct |
---|---|---|
Abraham Lincoln | 53990 | 42.1% |
George Washington | 37326 | 29.1% |
After using BeautifulSoup to read the entire site and identify the tables. I was successful in isolating this table from the rest of the tables on the site and identifying the header row using:
gov_table = tables[3]
rows = gov_table.find_all('tr')
header_row = rows[0]
The trouble i ran into was with the data rows. I cannot seem to pick up the candidate’s names, only their ‘total votes’ and ‘pct’.
I try:
first_data_row = rows[1]
first_data_row.find_all('td')
which gives the HTML:
[<td class="candidate" data-title="Candidate" scope="row">ABRAHAM LINCOLN <span class="smalltext">(DEM)</span> </td>,
<td class="number mail-in" width="25%">
<ul class="mailinbreakout">
<li>Polling place: 51771</li>
<li>Mail ballots: 2219</li>
</ul>
</td>,
<td class="number total votes" data-title="Total votes">53990</td>,
<td class="number total percent" data-title="Pct">42.1%</td>]
I then attempt to run a comprehension on all the td tags to isolate them in a list, which I will use as the rows to a DataFrame. But the trouble I have is, I cannot seem to pick up the candidates name:
In [82]: [str(x.string) for x in first_data_row.find_all('td')]
Out[82]: ['None', 'None', '53990', '42.1%']
I’m really stumped about the ‘None’ strings as they dont appear anywhere in the table rows themselves. I have tried narrowing in on it further using
In [83]: [str(x.string) for x in first_data_row.find_all('td', {'scope': 'row'})]
Out[83]: ['None']
or
In[87]: first_candidate_name = first_data_row.find_all('td')[0]
...first_candidate_name
...str(first_candidate_name.string)
Out[87]: 'None'
With similar results.
I am sure I am missing something relatively minor but my beginning eyes can’t narrow it down.
This should help you resolve the conundrum. Let me know if something is not clear.
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.ri.gov/election/results/2014/statewide_primary"
r = requests.get(url)
soup = bs(r.content, 'html.parser')
table_5 = soup.find_all('table')[5]
trs = table_5.find_all('tr')
tds = trs[1].find_all('td')
print(tds[0].text)
print(tds[2].text)
print(tds[3].text)
Output
Nellie M. GORBEA (DEM)
58444
51.4%
You’re using .string
to access the content within the rows, and some of these rows have multiple children, which means .string
will return None
On the other hand, .get_text()
returns all the strings of the children concatenated into one string
> [str(x.string) for x in first_data_row.find_all('td')]
> ['None', 'None', '53990', '42.1%']
> [str(x.get_text()) for x in first_data_row.find_all('td')]
> ['Gina M. RAIMONDO (DEM) ', 'nnPollingxa0place:xa051771nMailxa0ballots:xa02219nn', '53990', '42.1%']
From the documentation:
.string
- If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
- If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:
- If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
.get_text()
If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:
To pull Candidate
, Total Vote
s and Pct
, you can invoke stripped_strings
method then list slicing.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.ri.gov/election/results/2014/statewide_primary/#"
res = requests.post(url)
soup = BeautifulSoup(res.text,'lxml')
data = []
for tr in soup.select('div.raceResults table tbody tr'):
t = list(tr.stripped_strings)
data.append({
'Candidate': t[0] + t[1],
'Total votes': t[-2],
'Pct':t[-1]
})
df = pd.DataFrame(data)
print(df)
Output:
Candidate Total votes Pct
0 John F. REED(DEM) 98610 100.0%
1 David N. CICILLINE(DEM) 38186 63.0%
2 Matthew J. FECTEAU(DEM) 22447 37.0%
3 James R. LANGEVIN(DEM) 44512 100.0%
4 Gina M. RAIMONDO(DEM) 53990 42.1%
5 Angel TAVERAS(DEM) 37326 29.1%
6 H. Claiborne PELL(DEM) 34515 26.9%
7 Todd GIROUX(DEM) 2264 1.8%
8 Daniel J. McKEE(DEM) 50229 43.0%
9 A. Ralph MOLLIS(DEM) 42525 36.4%
10 Frank G. FERRI(DEM) 23970 20.5%
11 Nellie M. GORBEA(DEM) 58444 51.4%
12 Guillaume DE RAMEL(DEM) 55237 48.6%
13 Peter F. KILMARTIN(DEM) 91021 100.0%
14 Seth MAGAZINER(DEM) 80378 66.5%
15 Frank T. CAPRIO(DEM) 40402 33.5%
16 Mark S. ZACCARIA(REP) 23780 100.0%
17 Cormick Brendan LYNCH(REP) 6527 72.4%
18 Stanford TRAN(REP) 2483 27.6%
19 Rhue R. REIS(REP) 14143 100.0%
20 Allan W. FUNG(REP) 17530 54.9%
21 Kenneth J. BLOCK(REP) 14399 45.1%
22 Catherine Terry TAYLOR(REP) 17722 66.7%
23 Kara D. YOUNG(REP) 8831 33.3%
24 John M. CARLEVALE, SR.(REP) 23232 100.0%
25 Dawson Tucker HODGSON(REP) 23795 100.0%
You could simply loop the h3 within the results to get the positions/titles and use find_next to get the adjacent tables. Add the position to each table, read into a df using pandas, add each table to a list to turn into a single dataframe at the end with pd.concat. You can also extract, for example, the party from the candidate.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.ri.gov/election/results/2014/statewide_primary/#"
page = requests.get(url, headers = {'user-agent':'mozilla/5.0'})
soup = BeautifulSoup(page.content, "html.parser")
dfs = []
for position in soup.select('.results h3'):
title = position.a.text
df = pd.read_html(str(position.find_next('table')))[0]
df.loc[:, 'Title'] = title
dfs.append(df)
final = pd.concat(dfs, axis = 0)
final[['Candidate','Party']] = final['Candidate'].str.split(pat = ' (|)', n = 2 , regex= True, expand=True).iloc[:, :-1]
final[['Polling place', 'Mail ballots']] = final['Ballot breakout'].str.split(pat = '[Ws]+', regex= True, expand=True).iloc[:, [2,5]].astype(int)
final = final[['Title', 'Party', 'Candidate', 'Total votes', 'Pct', 'Polling place', 'Mail ballots']].sort_values(['Title', 'Party', 'Total votes'], ascending = [True, True, False])
final.reset_index(drop = True, inplace = True)
Sample result rows:
In addition, simplest approach would be to use pandas.read_html()
that works with BeautifulSoup
under the hood and is best practice to scrape table data.
import pandas as pd
pd.concat(pd.read_html('https://www.ri.gov/election/results/2014/statewide_primary/#'), ignore_index=True)
Output:
Candidate | Ballot breakout | Total votes | Pct | |
---|---|---|---|---|
0 | John F. REED (DEM) | Polling place: 94157 Mail ballots: 4453 | 98610 | 100.0% |
1 | David N. CICILLINE (DEM) | Polling place: 36220 Mail ballots: 1966 | 38186 | 63.0% |
2 | Matthew J. FECTEAU (DEM) | Polling place: 21637 Mail ballots: 810 | 22447 | 37.0% |
3 | James R. LANGEVIN (DEM) | Polling place: 42740 Mail ballots: 1772 | 44512 | 100.0% |
4 | Gina M. RAIMONDO (DEM) | Polling place: 51771 Mail ballots: 2219 | 53990 | 42.1% |
…
Or using get_text()
:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.ri.gov/election/results/2014/statewide_primary/#"
res = requests.post(url)
soup = BeautifulSoup(res.text)
data = []
for tr in soup.select('div.raceResults table tbody tr'):
d = [t.get_text(' ',strip=True) for t in tr.select('td')]
del d[1]
data.append(dict(zip(['Candidate','Total votes','Pct'],d)))
pd.DataFrame(data)