Trouble scraping data rows – beautifulSoup

Question:

Beginner working with python and beautiful soup, attempting to scrape election results data from a state elections page. Went by the book ‘learning to code with baseball’ to learn all of my basics, including the 5th chapter which covers scraping.

I am working on scraping one table from the site, which looks like this:

Candidate Total Votes Pct
Abraham Lincoln 53990 42.1%
George Washington 37326 29.1%

After using BeautifulSoup to read the entire site and identify the tables. I was successful in isolating this table from the rest of the tables on the site and identifying the header row using:

gov_table = tables[3]
rows = gov_table.find_all('tr')
header_row = rows[0]

The trouble i ran into was with the data rows. I cannot seem to pick up the candidate’s names, only their ‘total votes’ and ‘pct’.

I try:

first_data_row = rows[1]
first_data_row.find_all('td')

which gives the HTML:

[<td class="candidate" data-title="Candidate" scope="row">ABRAHAM LINCOLN <span class="smalltext">(DEM)</span> </td>,
 <td class="number mail-in" width="25%">
 <ul class="mailinbreakout">
 <li>Polling place: 51771</li>
 <li>Mail ballots: 2219</li>
 </ul>
 </td>,
 <td class="number total votes" data-title="Total votes">53990</td>,
 <td class="number total percent" data-title="Pct">42.1%</td>]

I then attempt to run a comprehension on all the td tags to isolate them in a list, which I will use as the rows to a DataFrame. But the trouble I have is, I cannot seem to pick up the candidates name:

In [82]: [str(x.string) for x in first_data_row.find_all('td')]
Out[82]: ['None', 'None', '53990', '42.1%']

I’m really stumped about the ‘None’ strings as they dont appear anywhere in the table rows themselves. I have tried narrowing in on it further using

In [83]: [str(x.string) for x in first_data_row.find_all('td', {'scope': 'row'})]
Out[83]: ['None']

or

In[87]: first_candidate_name = first_data_row.find_all('td')[0]
...first_candidate_name
...str(first_candidate_name.string)
Out[87]: 'None'

With similar results.

I am sure I am missing something relatively minor but my beginning eyes can’t narrow it down.

Asked By: b_shumate052

||

Answers:

This should help you resolve the conundrum. Let me know if something is not clear.

import requests
from bs4 import BeautifulSoup as bs

url = "https://www.ri.gov/election/results/2014/statewide_primary"
r = requests.get(url)
soup = bs(r.content, 'html.parser')

table_5 = soup.find_all('table')[5]
trs = table_5.find_all('tr')
tds = trs[1].find_all('td')
print(tds[0].text)
print(tds[2].text)
print(tds[3].text)

Output

Nellie M. GORBEA (DEM)
58444
51.4%
Answered By: C. Pappy

You’re using .string to access the content within the rows, and some of these rows have multiple children, which means .string will return None

On the other hand, .get_text() returns all the strings of the children concatenated into one string

> [str(x.string) for x in first_data_row.find_all('td')]
> ['None', 'None', '53990', '42.1%']
> [str(x.get_text()) for x in first_data_row.find_all('td')]
> ['Gina M. RAIMONDO (DEM) ', 'nnPollingxa0place:xa051771nMailxa0ballots:xa02219nn', '53990', '42.1%']

From the documentation:

.string

  1. If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
  2. If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:
  3. If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

.get_text()
If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:

Answered By: Water Man

To pull Candidate, Total Votes and Pct, you can invoke stripped_strings method then list slicing.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.ri.gov/election/results/2014/statewide_primary/#"

res = requests.post(url)


soup = BeautifulSoup(res.text,'lxml')

data = []
for tr in soup.select('div.raceResults table tbody tr'):
    t = list(tr.stripped_strings)

    data.append({
        'Candidate': t[0] + t[1],
        'Total votes': t[-2],
        'Pct':t[-1]
        })
df = pd.DataFrame(data)
print(df)

Output:

                 Candidate        Total votes  Pct
0             John F. REED(DEM)       98610  100.0%
1       David N. CICILLINE(DEM)       38186   63.0%
2       Matthew J. FECTEAU(DEM)       22447   37.0%
3        James R. LANGEVIN(DEM)       44512  100.0%
4         Gina M. RAIMONDO(DEM)       53990   42.1%
5            Angel TAVERAS(DEM)       37326   29.1%
6        H. Claiborne PELL(DEM)       34515   26.9%
7              Todd GIROUX(DEM)        2264    1.8%
8          Daniel J. McKEE(DEM)       50229   43.0%
9          A. Ralph MOLLIS(DEM)       42525   36.4%
10          Frank G. FERRI(DEM)       23970   20.5%
11        Nellie M. GORBEA(DEM)       58444   51.4%
12      Guillaume DE RAMEL(DEM)       55237   48.6%
13      Peter F. KILMARTIN(DEM)       91021  100.0%
14          Seth MAGAZINER(DEM)       80378   66.5%
15         Frank T. CAPRIO(DEM)       40402   33.5%
16        Mark S. ZACCARIA(REP)       23780  100.0%
17   Cormick Brendan LYNCH(REP)        6527   72.4%
18           Stanford TRAN(REP)        2483   27.6%
19            Rhue R. REIS(REP)       14143  100.0%
20           Allan W. FUNG(REP)       17530   54.9%
21        Kenneth J. BLOCK(REP)       14399   45.1%
22  Catherine Terry TAYLOR(REP)       17722   66.7%
23           Kara D. YOUNG(REP)        8831   33.3%
24  John M. CARLEVALE, SR.(REP)       23232  100.0%
25   Dawson Tucker HODGSON(REP)       23795  100.0%


    
Answered By: F.Hoque

You could simply loop the h3 within the results to get the positions/titles and use find_next to get the adjacent tables. Add the position to each table, read into a df using pandas, add each table to a list to turn into a single dataframe at the end with pd.concat. You can also extract, for example, the party from the candidate.

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://www.ri.gov/election/results/2014/statewide_primary/#"
page = requests.get(url, headers = {'user-agent':'mozilla/5.0'})
soup = BeautifulSoup(page.content, "html.parser")
dfs = []

for position in soup.select('.results h3'):
    title = position.a.text
    df = pd.read_html(str(position.find_next('table')))[0]
    df.loc[:, 'Title'] = title
    dfs.append(df)
    
final = pd.concat(dfs, axis = 0)
final[['Candidate','Party']] = final['Candidate'].str.split(pat = ' (|)', n = 2 , regex= True, expand=True).iloc[:, :-1]
final[['Polling place', 'Mail ballots']] = final['Ballot breakout'].str.split(pat = '[Ws]+', regex= True, expand=True).iloc[:, [2,5]].astype(int)
final = final[['Title', 'Party', 'Candidate', 'Total votes', 'Pct', 'Polling place', 'Mail ballots']].sort_values(['Title', 'Party', 'Total votes'], ascending = [True, True, False])
final.reset_index(drop = True, inplace = True)

Sample result rows:

enter image description here

Answered By: QHarr

In addition, simplest approach would be to use pandas.read_html() that works with BeautifulSoup under the hood and is best practice to scrape table data.

import pandas as pd
pd.concat(pd.read_html('https://www.ri.gov/election/results/2014/statewide_primary/#'), ignore_index=True)

Output:

Candidate Ballot breakout Total votes Pct
0 John F. REED (DEM) Polling place: 94157 Mail ballots: 4453 98610 100.0%
1 David N. CICILLINE (DEM) Polling place: 36220 Mail ballots: 1966 38186 63.0%
2 Matthew J. FECTEAU (DEM) Polling place: 21637 Mail ballots: 810 22447 37.0%
3 James R. LANGEVIN (DEM) Polling place: 42740 Mail ballots: 1772 44512 100.0%
4 Gina M. RAIMONDO (DEM) Polling place: 51771 Mail ballots: 2219 53990 42.1%


Or using get_text():

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.ri.gov/election/results/2014/statewide_primary/#"
res = requests.post(url)

soup = BeautifulSoup(res.text)

data = []
for tr in soup.select('div.raceResults table tbody tr'):
    d = [t.get_text(' ',strip=True) for t in tr.select('td')]
    del d[1]
    data.append(dict(zip(['Candidate','Total votes','Pct'],d)))

pd.DataFrame(data)
Answered By: HedgeHog