Extracting values from HTML in python

Question:

My project involves web scraping using python. In my project I need to get data about a given its registration. I have managed to get the html from the site into python but I am struggling to extract the values.
I am using this website: https://www.carcheck.co.uk/audi/N18CTN

from bs4 import BeautifulSoup
import requests

url = "https://www.carcheck.co.uk/audi/N18CTN"

r= requests.get(url)

soup = BeautifulSoup(r.text)

print(soup)

I need to get this information about the vehicle

<td>AUDI</td>
</tr>
<tr>
<th>Model</th>
<td>A3</td>
</tr>
<tr>
<th>Colour</th>
<td>Red</td>
</tr>
<tr>
<th>Year of manufacture</th>
<td>2017</td>
</tr>
<tr>
<th>Top speed</th>
<td>147 mph</td>
</tr>
<tr>
<th>Gearbox</th>
<td>6 speed automatic</td>

How would I go about doing this?

Asked By: charlie s

||

Answers:

You can use this example to get you started how to get information from this page:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.carcheck.co.uk/audi/N18CTN'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for row in soup.select('tr:has(th):has(td):not(:has(table))'):
    header = row.find_previous('h1').text.strip()
    title = row.th.text.strip()
    text = row.td.text.strip()
    all_data.append((header, title, text))

df = pd.DataFrame(all_data, columns = ['Header', 'Title', 'Value'])
print(df.head(20).to_markdown(index=False))

Prints:

Header Title Value
General information Make AUDI
General information Model A3
General information Colour Red
General information Year of manufacture 2017
General information Top speed 147 mph
General information Gearbox 6 speed automatic
Engine & fuel consumption Power 135 kW / 184 HP
Engine & fuel consumption Engine capacity 1.968 cc
Engine & fuel consumption Cylinders 4
Engine & fuel consumption Fuel type Diesel
Engine & fuel consumption Consumption city 42.0 mpg
Engine & fuel consumption Consumption extra urban 52.3 mpg
Engine & fuel consumption Consumption combined 48.0 mpg
Engine & fuel consumption CO2 emission 129 g/km
Engine & fuel consumption CO2 label D
MOT history MOT expiry date 2023-10-27
MOT history MOT pass rate 83 %
MOT history MOT passed 5
MOT history Failed MOT tests 1
MOT history Total advice items 11
Answered By: Andrej Kesely

Since you don’t have extensive experience with BeautifulSoup, you can effortlessly match the table containing the car information using a CSS selector and then you can extract the header and data rows to combine them into a dictionary:

import requests
from bs4 import BeautifulSoup

url = "https://www.carcheck.co.uk/audi/N18CTN"
soup = BeautifulSoup(requests.get(url).text, "lxml")

# Select the table containing the car information using CSS selector
table = soup.select_one("div.page:nth-child(2) > div:nth-child(4) > div:nth-child(1) > table:nth-child(1)")
# Extract header rows from the table and store them in a list
headers = [th.text for th in table.select("th")]
# Extract data rows from the table and store them in a list
data = [td.text for td in table.select("td")]
# Combine header rows and data rows into a dictionary using a dict comprehension
car_info = {key: value for key, value in zip(headers, data)}

print(car_info)

Ouput:

{'Make': 'AUDI', 'Model': 'A3', 'Colour': 'Red', 'Year of manufacture': '2017', 'Top speed': '147 mph', 'Gearbox': '6 speed automatic'}

In order to obtain the CSS selector pattern of the table you can use the devtools of your browser:

enter image description here

Answered By: Andreas Violaris