Extracting values from HTML in python
Question:
My project involves web scraping using python. In my project I need to get data about a given its registration. I have managed to get the html from the site into python but I am struggling to extract the values.
I am using this website: https://www.carcheck.co.uk/audi/N18CTN
from bs4 import BeautifulSoup
import requests
url = "https://www.carcheck.co.uk/audi/N18CTN"
r= requests.get(url)
soup = BeautifulSoup(r.text)
print(soup)
I need to get this information about the vehicle
<td>AUDI</td>
</tr>
<tr>
<th>Model</th>
<td>A3</td>
</tr>
<tr>
<th>Colour</th>
<td>Red</td>
</tr>
<tr>
<th>Year of manufacture</th>
<td>2017</td>
</tr>
<tr>
<th>Top speed</th>
<td>147 mph</td>
</tr>
<tr>
<th>Gearbox</th>
<td>6 speed automatic</td>
How would I go about doing this?
Answers:
You can use this example to get you started how to get information from this page:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.carcheck.co.uk/audi/N18CTN'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('tr:has(th):has(td):not(:has(table))'):
header = row.find_previous('h1').text.strip()
title = row.th.text.strip()
text = row.td.text.strip()
all_data.append((header, title, text))
df = pd.DataFrame(all_data, columns = ['Header', 'Title', 'Value'])
print(df.head(20).to_markdown(index=False))
Prints:
Header
Title
Value
General information
Make
AUDI
General information
Model
A3
General information
Colour
Red
General information
Year of manufacture
2017
General information
Top speed
147 mph
General information
Gearbox
6 speed automatic
Engine & fuel consumption
Power
135 kW / 184 HP
Engine & fuel consumption
Engine capacity
1.968 cc
Engine & fuel consumption
Cylinders
4
Engine & fuel consumption
Fuel type
Diesel
Engine & fuel consumption
Consumption city
42.0 mpg
Engine & fuel consumption
Consumption extra urban
52.3 mpg
Engine & fuel consumption
Consumption combined
48.0 mpg
Engine & fuel consumption
CO2 emission
129 g/km
Engine & fuel consumption
CO2 label
D
MOT history
MOT expiry date
2023-10-27
MOT history
MOT pass rate
83 %
MOT history
MOT passed
5
MOT history
Failed MOT tests
1
MOT history
Total advice items
11
Since you don’t have extensive experience with BeautifulSoup, you can effortlessly match the table containing the car information using a CSS selector and then you can extract the header and data rows to combine them into a dictionary:
import requests
from bs4 import BeautifulSoup
url = "https://www.carcheck.co.uk/audi/N18CTN"
soup = BeautifulSoup(requests.get(url).text, "lxml")
# Select the table containing the car information using CSS selector
table = soup.select_one("div.page:nth-child(2) > div:nth-child(4) > div:nth-child(1) > table:nth-child(1)")
# Extract header rows from the table and store them in a list
headers = [th.text for th in table.select("th")]
# Extract data rows from the table and store them in a list
data = [td.text for td in table.select("td")]
# Combine header rows and data rows into a dictionary using a dict comprehension
car_info = {key: value for key, value in zip(headers, data)}
print(car_info)
Ouput:
{'Make': 'AUDI', 'Model': 'A3', 'Colour': 'Red', 'Year of manufacture': '2017', 'Top speed': '147 mph', 'Gearbox': '6 speed automatic'}
In order to obtain the CSS selector pattern of the table you can use the devtools of your browser:
My project involves web scraping using python. In my project I need to get data about a given its registration. I have managed to get the html from the site into python but I am struggling to extract the values.
I am using this website: https://www.carcheck.co.uk/audi/N18CTN
from bs4 import BeautifulSoup
import requests
url = "https://www.carcheck.co.uk/audi/N18CTN"
r= requests.get(url)
soup = BeautifulSoup(r.text)
print(soup)
I need to get this information about the vehicle
<td>AUDI</td>
</tr>
<tr>
<th>Model</th>
<td>A3</td>
</tr>
<tr>
<th>Colour</th>
<td>Red</td>
</tr>
<tr>
<th>Year of manufacture</th>
<td>2017</td>
</tr>
<tr>
<th>Top speed</th>
<td>147 mph</td>
</tr>
<tr>
<th>Gearbox</th>
<td>6 speed automatic</td>
How would I go about doing this?
You can use this example to get you started how to get information from this page:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.carcheck.co.uk/audi/N18CTN'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('tr:has(th):has(td):not(:has(table))'):
header = row.find_previous('h1').text.strip()
title = row.th.text.strip()
text = row.td.text.strip()
all_data.append((header, title, text))
df = pd.DataFrame(all_data, columns = ['Header', 'Title', 'Value'])
print(df.head(20).to_markdown(index=False))
Prints:
Header | Title | Value |
---|---|---|
General information | Make | AUDI |
General information | Model | A3 |
General information | Colour | Red |
General information | Year of manufacture | 2017 |
General information | Top speed | 147 mph |
General information | Gearbox | 6 speed automatic |
Engine & fuel consumption | Power | 135 kW / 184 HP |
Engine & fuel consumption | Engine capacity | 1.968 cc |
Engine & fuel consumption | Cylinders | 4 |
Engine & fuel consumption | Fuel type | Diesel |
Engine & fuel consumption | Consumption city | 42.0 mpg |
Engine & fuel consumption | Consumption extra urban | 52.3 mpg |
Engine & fuel consumption | Consumption combined | 48.0 mpg |
Engine & fuel consumption | CO2 emission | 129 g/km |
Engine & fuel consumption | CO2 label | D |
MOT history | MOT expiry date | 2023-10-27 |
MOT history | MOT pass rate | 83 % |
MOT history | MOT passed | 5 |
MOT history | Failed MOT tests | 1 |
MOT history | Total advice items | 11 |
Since you don’t have extensive experience with BeautifulSoup, you can effortlessly match the table containing the car information using a CSS selector and then you can extract the header and data rows to combine them into a dictionary:
import requests
from bs4 import BeautifulSoup
url = "https://www.carcheck.co.uk/audi/N18CTN"
soup = BeautifulSoup(requests.get(url).text, "lxml")
# Select the table containing the car information using CSS selector
table = soup.select_one("div.page:nth-child(2) > div:nth-child(4) > div:nth-child(1) > table:nth-child(1)")
# Extract header rows from the table and store them in a list
headers = [th.text for th in table.select("th")]
# Extract data rows from the table and store them in a list
data = [td.text for td in table.select("td")]
# Combine header rows and data rows into a dictionary using a dict comprehension
car_info = {key: value for key, value in zip(headers, data)}
print(car_info)
Ouput:
{'Make': 'AUDI', 'Model': 'A3', 'Colour': 'Red', 'Year of manufacture': '2017', 'Top speed': '147 mph', 'Gearbox': '6 speed automatic'}
In order to obtain the CSS selector pattern of the table you can use the devtools of your browser: