Python Pandas Web Scraping

Question

I’m trying to turn a list of tables on this page into a Pandas DataFrame:
https://intermediaries.hsbc.co.uk/products/product-finder/

I want to select the customer type box only and select one of the elements (from first to last) and then click find product to display the table for each one before concatenating all the DataFrames into 1 DataFrame.

So far I have managed to select the first element and print the table but I can’t seem to turn it into a pandas DataFrame as I get a value error: Must pass 2-d input. shape=(1, 38, 12)

This is my code:

def product_type_button(self):
    select = Select(self.driver.find_element_by_id('Availability'))
    try:
        select.select_by_visible_text('First time buyer')
    except NoSuchElementException:
        print('The item does not exist')
    time.sleep(5)
    
    self.driver.find_element_by_xpath('//button[@type="button" and (contains(text(),"Find product"))]').click()
    time.sleep(5)

def create_dataframe(self):

    data1 = pd.read_html(self.driver.page_source)
    print(data1)

    data2 = pd.DataFrame(data1)
    time.sleep(5)

    data2.to_csv('Data1.csv')

I would like to find a way to print the table for each element, maybe selecting by index instead? and then concatenating into one DataFrame. Any help would be appreciated.

Asked By: user19795989

||

Source

Answer 1

All data for the table is located inside javascript file. You can use re/json to parse it and then construct the dataframe:

import re
import json
import requests
import pandas as pd


js_src = "https://intermediaries.hsbc.co.uk/component---src-pages-products-product-finder-js-9c7004fb8446c3fe0a07.js"
data = requests.get(js_src).text

data = re.search(r"JSON.parse('(.*)')", data).group(1)
data = json.loads(data)

df = pd.DataFrame(data)
print(df.head().to_markdown(index=False))

df.to_csv("data.csv", index=False)

Prints:

Changed	NewProductCode	ProductType	Deal Period (Fixed until)	ProductDescription1	ProductTerm	Availability	Repayment Basis	Min LTV %	MaxLTV	Minimum Loan ?	Fees Payable Per	Rate	Reversionary Rate %	APR	Cashback	ERC – Payable	Unlimited lump sum payments Premitted (without fees)	Unlimited overpayment permitted (without fees)	Overpayments	ERC	Portable	Completionfee	Free Legals for Remortgage	FreeStandardValuation	Loading
Continued	4071976	Fixed	31.03.25	Fee Saver*	2 Year Fixed	FTB	CR and IO	0%	60%	10,000	5,000,000	5.99%	5.04%	5.4	No	Yes	No	No	Yes	Please refer to HSBC policy guide	Yes	17	Yes	Yes	nan
Continued	4071977	Fixed	31.03.25	Fee Saver*	2 Year Fixed	FTB	CR and IO	0%	70%	10,000	2,000,000	6.04%	5.04%	5.4	No	Yes	No	No	Yes	Please refer to HSBC policy guide	Yes	17	Yes	Yes	nan
Continued	4071978	Fixed	31.03.25	Fee Saver*	2 Year Fixed	FTB	CR and IO	0%	75%	10,000	2,000,000	6.04%	5.04%	5.4	No	Yes	No	No	Yes	Please refer to HSBC policy guide	Yes	17	Yes	Yes	nan
Continued	4071979	Fixed	31.03.25	Fee Saver*	2 Year Fixed	FTB	CR	0%	80%	10,000	1,000,000	6.14%	5.04%	5.4	No	Yes	No	No	Yes	Please refer to HSBC policy guide	Yes	17	Yes	Yes	nan
Continued	4071980	Fixed	31.03.25	Fee Saver*	2 Year Fixed	FTB	CR	0%	85%	10,000	750,000	6.19%	5.04%	5.4	No	Yes	No	No	Yes	Please refer to HSBC policy guide	Yes	17	Yes	Yes	nan

and saves data.csv (screenshot from LibreOffice):

Answered By: Andrej Kesely

Answer 2

A minimal change to your code: pd.read_html returns a list of dataframes for all tables found on the webpage.

Since there is only one table on your page, data1 is a list of 1 dataframe. This is where the error Must pass 2-d input. shape=(1, 38, 12) comes from – data1 contains 1 dataframe of shape (38, 12).

You probably want to do just:

data2 = data1[0]
data2.to_csv(...)

(Also, no need to sleep after reading from the webpage)

Answered By: ozacha

Python Pandas Web Scraping

Question:

Answers: