Extracting values from HTML in python

Question

My project involves web scraping using python. In my project I need to get data about a given its registration. I have managed to get the html from the site into python but I am struggling to extract the values.
I am using this website: https://www.carcheck.co.uk/audi/N18CTN

from bs4 import BeautifulSoup
import requests

url = "https://www.carcheck.co.uk/audi/N18CTN"

r= requests.get(url)

soup = BeautifulSoup(r.text)

print(soup)

I need to get this information about the vehicle

<td>AUDI</td>
</tr>
<tr>
<th>Model</th>
<td>A3</td>
</tr>
<tr>
<th>Colour</th>
<td>Red</td>
</tr>
<tr>
<th>Year of manufacture</th>
<td>2017</td>
</tr>
<tr>
<th>Top speed</th>
<td>147 mph</td>
</tr>
<tr>
<th>Gearbox</th>
<td>6 speed automatic</td>

How would I go about doing this?

Asked By: charlie s

||

Source

Answer 1

You can use this example to get you started how to get information from this page:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.carcheck.co.uk/audi/N18CTN'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for row in soup.select('tr:has(th):has(td):not(:has(table))'):
    header = row.find_previous('h1').text.strip()
    title = row.th.text.strip()
    text = row.td.text.strip()
    all_data.append((header, title, text))

df = pd.DataFrame(all_data, columns = ['Header', 'Title', 'Value'])
print(df.head(20).to_markdown(index=False))

Prints:

Header	Title	Value
General information	Make	AUDI
General information	Model	A3
General information	Colour	Red
General information	Year of manufacture	2017
General information	Top speed	147 mph
General information	Gearbox	6 speed automatic
Engine & fuel consumption	Power	135 kW / 184 HP
Engine & fuel consumption	Engine capacity	1.968 cc
Engine & fuel consumption	Cylinders	4
Engine & fuel consumption	Fuel type	Diesel
Engine & fuel consumption	Consumption city	42.0 mpg
Engine & fuel consumption	Consumption extra urban	52.3 mpg
Engine & fuel consumption	Consumption combined	48.0 mpg
Engine & fuel consumption	CO2 emission	129 g/km
Engine & fuel consumption	CO2 label	D
MOT history	MOT expiry date	2023-10-27
MOT history	MOT pass rate	83 %
MOT history	MOT passed	5
MOT history	Failed MOT tests	1
MOT history	Total advice items	11

Answered By: Andrej Kesely

Answer 2

Since you don’t have extensive experience with BeautifulSoup, you can effortlessly match the table containing the car information using a CSS selector and then you can extract the header and data rows to combine them into a dictionary:

import requests
from bs4 import BeautifulSoup

url = "https://www.carcheck.co.uk/audi/N18CTN"
soup = BeautifulSoup(requests.get(url).text, "lxml")

# Select the table containing the car information using CSS selector
table = soup.select_one("div.page:nth-child(2) > div:nth-child(4) > div:nth-child(1) > table:nth-child(1)")
# Extract header rows from the table and store them in a list
headers = [th.text for th in table.select("th")]
# Extract data rows from the table and store them in a list
data = [td.text for td in table.select("td")]
# Combine header rows and data rows into a dictionary using a dict comprehension
car_info = {key: value for key, value in zip(headers, data)}

print(car_info)

Ouput:

{'Make': 'AUDI', 'Model': 'A3', 'Colour': 'Red', 'Year of manufacture': '2017', 'Top speed': '147 mph', 'Gearbox': '6 speed automatic'}

In order to obtain the CSS selector pattern of the table you can use the devtools of your browser:

Answered By: Andreas Violaris

Extracting values from HTML in python

Question:

Answers: