How to remove whitespace/tab from an entry when scraping a web table? (python)

Question:

I’ve cobbled together the following code that scrapes a website table using Beautiful Soup.
The script is working as intended except for the first two entries.
Q1: The first entry consists of two empty brackets… how do I omit them?
Q2: The second entry has a hiden tab creating whitespace in the second element that I can’t get rid of. How do I remove it?

Code:

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
testlink = "https://www.crutchfield.com/p_13692194/JL-Audio-12TW3-D8.html?tp=64077"

r = requests.get(testlink, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', class_='table table-striped')

df = pd.DataFrame(columns=['col1', 'col2'])
rows = []
for i, row in enumerate(table.find_all('tr')):
    rows.append([el.text.strip() for el in row.find_all('td')])                
    
for row in rows:
    print(row)

Results:

[]
['Size', '12        -inch']
['Impedance (Ohms)', '4, 16']
['Cone Material', 'Mica-Filled IMPP']
['Surround Material', 'Rubber']
['Ideal Sealed Box Volume (cubic feet)', '1']
['Ideal Ported Box Volume (cubic feet)', '1.3']
['Port diameter (inches)', 'N/A']
['Port length (inches)', 'N/A']
['Free-Air', 'No']
['Dual Voice Coil', 'Yes']
['Sensitivity', '84.23 dB at 1 watt']
['Frequency Response', '24 - 200 Hz']
['Max RMS Power Handling', '400']
['Peak Power Handling (Watts)', '800']
['Top Mount Depth (inches)', '3 1/2']
['Bottom Mount Depth (inches)', 'N/A']
['Cutout Diameter or Length (inches)', '11 5/8']
['Vas (liters)', '34.12']
['Fs (Hz)', '32.66']
['Qts', '0.668']
['Xmax (millimeters)', '15.2']
['Parts Warranty', '1 Year']
['Labor Warranty', '1 Year']
Asked By: Uberverbosity

||

Answers:

You can clean the results like this if you want.

rows = []
for i, row in enumerate(table.find_all('tr')):
    cells = [
        el.text.strip().replace("t", "")   ## remove tabs
        for el
        in row.find_all('td')
    ]

    ## don't add a row with no tds
    if cells:
        rows.append(cells)                

I think you can further simplify this with a walrus :=

rows = [
    [cell.text.strip().replace("t", "") for cell in cells]
    for row in table.find_all('tr')
    if (cells := row.find_all('td'))
]
Answered By: JonSG

Let’s simplify, shall we?

import pandas as pd

df = pd.read_html('https://www.crutchfield.com/S-f7IbEJ40aHd/p_13692194/JL-Audio-12TW3-D8.html?tp=64077')[0]
df.columns = ['Property', 'Value', 'Not Needed']
print(df[['Property', 'Value']])

Result in terminal:

Property    Value
0   Size    12 -inch
1   Impedance (Ohms)    4, 16
2   Cone Material   Mica-Filled IMPP
3   Surround Material   Rubber
4   Ideal Sealed Box Volume (cubic feet)    1
5   Ideal Ported Box Volume (cubic feet)    1.3
6   Port diameter (inches)  NaN
7   Port length (inches)    NaN
8   Free-Air    No
9   Dual Voice Coil Yes
10  Sensitivity 84.23 dB at 1 watt
11  Frequency Response  24 - 200 Hz
12  Max RMS Power Handling  400
13  Peak Power Handling (Watts) 800
14  Top Mount Depth (inches)    3 1/2
15  Bottom Mount Depth (inches) NaN
16  Cutout Diameter or Length (inches)  11 5/8
17  Vas (liters)    34.12
18  Fs (Hz) 32.66
19  Qts 0.668
20  Xmax (millimeters)  15.2
21  Parts Warranty  1 Year
22  Labor Warranty  1 Year

Pandas documentation can be found here.

Answered By: Barry the Platipus

Let Pandas do it all

No need for anything else
pandas can read tables inside html


    url='https://www.crutchfield.com/p_13692194/JL-Audio-12TW3-D8.html?tp=64077'
    df=pd.read_html(url,attrs={'class':'table table-striped'})[0]
    df.columns=['Features','Specs','Blank']
    df.drop('Blank',axis=1,inplace=True) # get rid of the hidden column

Thats it
Seems to me all good no spaces

if you still feel there are spaces left in some column

df['Features']=df['Features'].apply(lambda x:x.strip()) #Not Needed

if you need to pass headers in request..(you can pass requests response to pd.read_html)

ps: it works without headers for the given URL

df=pd.read_html(requests.get(url,headers=headers).content,
                attrs={'class':'table table-striped'})[0]
Answered By: geekay
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.