How to scrape multiple tables with same name?
Question:
I am trying to scrape a site where the table classes have the same name.
There are 3 types of tables and I want to get the headers just once then get all the information from all three tables into a xlsx file.
Website = https://wiki.warthunder.com/List_of_vehicle_battle_ratings
running the code with vehical = soup.find('table')
works. But I only get the first tables information.
I’ve tried changing it into vehical = soup.find_all('table')
But that gives me this error.
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Here is my full code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
def updatebr():
url='https://wiki.warthunder.com/List_of_vehicle_battle_ratings'
headers =[]
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
vehical = soup.find('table')
for i in vehical.find_all('th'):
title = i.text
headers.append(title)
df = pd.DataFrame(columns = headers)
for row in vehical.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text for td in data]
length = len(df)
df.loc[length] = row_data
df.to_excel('brlist.xlsx')
Full Error Code:
Traceback (most recent call last):
File "c:PythonWTBRtest.py", line 35, in <module>
updatebr()
File "c:PythonWTBRtest.py", line 24, in updatebr
test = vehical.find_all('tr')
File "C:libsite-packagesbs4element.py", line 2289, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
enter code here
Answers:
Make it more simple, since you already involve pandas
– This wil pd.read_html()
all tables in a list an pd.concat()
them to a single one:
pd.concat(
pd.read_html(
'https://wiki.warthunder.com/List_of_vehicle_battle_ratings',
attrs={'class':'wikitable'}
),
ignore_index=True
).to_excel('brlist.xlsx')
country
type
name
ab
rb
sb
0
Italy
Utility helicopter
A.109EOA-2
8.7
9
9.3
1
Italy
Attack helicopter
A-129 International (p)
9.7
10
9.7
…
…
…
…
…
…
…
1945
USSR
Frigate
Rosomacha
4
4
4
1946
USSR
Motor gun boat
Ya-5M
1.3
1.3
1.3
However to answer your question – Since using vehical = soup.find_all('table')
you have to performe an additional loop iterating the ResultSet
. Used stripped_strings
here to simplify.
...
url='https://wiki.warthunder.com/List_of_vehicle_battle_ratings'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
vehical = soup.select('table.wikitable')
pd.DataFrame(
[list(row.stripped_strings)
for t in vehical
for row in t.select('tr:has(td)')
],
columns=list(soup.table.tr.stripped_strings)
).to_excel('brlist.xlsx')
I am trying to scrape a site where the table classes have the same name.
There are 3 types of tables and I want to get the headers just once then get all the information from all three tables into a xlsx file.
Website = https://wiki.warthunder.com/List_of_vehicle_battle_ratings
running the code with vehical = soup.find('table')
works. But I only get the first tables information.
I’ve tried changing it into vehical = soup.find_all('table')
But that gives me this error.
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Here is my full code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
def updatebr():
url='https://wiki.warthunder.com/List_of_vehicle_battle_ratings'
headers =[]
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
vehical = soup.find('table')
for i in vehical.find_all('th'):
title = i.text
headers.append(title)
df = pd.DataFrame(columns = headers)
for row in vehical.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text for td in data]
length = len(df)
df.loc[length] = row_data
df.to_excel('brlist.xlsx')
Full Error Code:
Traceback (most recent call last):
File "c:PythonWTBRtest.py", line 35, in <module>
updatebr()
File "c:PythonWTBRtest.py", line 24, in updatebr
test = vehical.find_all('tr')
File "C:libsite-packagesbs4element.py", line 2289, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
enter code here
Make it more simple, since you already involve pandas
– This wil pd.read_html()
all tables in a list an pd.concat()
them to a single one:
pd.concat(
pd.read_html(
'https://wiki.warthunder.com/List_of_vehicle_battle_ratings',
attrs={'class':'wikitable'}
),
ignore_index=True
).to_excel('brlist.xlsx')
country | type | name | ab | rb | sb | |
---|---|---|---|---|---|---|
0 | Italy | Utility helicopter | A.109EOA-2 | 8.7 | 9 | 9.3 |
1 | Italy | Attack helicopter | A-129 International (p) | 9.7 | 10 | 9.7 |
… | … | … | … | … | … | … |
1945 | USSR | Frigate | Rosomacha | 4 | 4 | 4 |
1946 | USSR | Motor gun boat | Ya-5M | 1.3 | 1.3 | 1.3 |
However to answer your question – Since using vehical = soup.find_all('table')
you have to performe an additional loop iterating the ResultSet
. Used stripped_strings
here to simplify.
...
url='https://wiki.warthunder.com/List_of_vehicle_battle_ratings'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
vehical = soup.select('table.wikitable')
pd.DataFrame(
[list(row.stripped_strings)
for t in vehical
for row in t.select('tr:has(td)')
],
columns=list(soup.table.tr.stripped_strings)
).to_excel('brlist.xlsx')