Python Pandas Web Scraping
Question:
I’m trying to turn a list of tables on this page into a Pandas DataFrame:
https://intermediaries.hsbc.co.uk/products/product-finder/
I want to select the customer type box only and select one of the elements (from first to last) and then click find product to display the table for each one before concatenating all the DataFrames into 1 DataFrame.
So far I have managed to select the first element and print the table but I can’t seem to turn it into a pandas DataFrame as I get a value error: Must pass 2-d input. shape=(1, 38, 12)
This is my code:
def product_type_button(self):
select = Select(self.driver.find_element_by_id('Availability'))
try:
select.select_by_visible_text('First time buyer')
except NoSuchElementException:
print('The item does not exist')
time.sleep(5)
self.driver.find_element_by_xpath('//button[@type="button" and (contains(text(),"Find product"))]').click()
time.sleep(5)
def create_dataframe(self):
data1 = pd.read_html(self.driver.page_source)
print(data1)
data2 = pd.DataFrame(data1)
time.sleep(5)
data2.to_csv('Data1.csv')
I would like to find a way to print the table for each element, maybe selecting by index instead? and then concatenating into one DataFrame. Any help would be appreciated.
Answers:
All data for the table is located inside javascript file. You can use re
/json
to parse it and then construct the dataframe:
import re
import json
import requests
import pandas as pd
js_src = "https://intermediaries.hsbc.co.uk/component---src-pages-products-product-finder-js-9c7004fb8446c3fe0a07.js"
data = requests.get(js_src).text
data = re.search(r"JSON.parse('(.*)')", data).group(1)
data = json.loads(data)
df = pd.DataFrame(data)
print(df.head().to_markdown(index=False))
df.to_csv("data.csv", index=False)
Prints:
Changed
NewProductCode
PreviousProductCode
ProductType
Deal Period (Fixed until)
ProductDescription1
ProductTerm
Availability
Repayment Basis
Min LTV %
MaxLTV
Minimum Loan ?
Fees Payable Per
Rate
Reversionary Rate %
APR
BookingFee
Cashback
CashbackValue
ERC – Payable
Unlimited lump sum payments Premitted (without fees)
Unlimited overpayment permitted (without fees)
Overpayments
ERC
Portable
Completionfee
Free Legals for Remortgage
FreeStandardValuation
Loading
Continued
4071976
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR and IO
0%
60%
10,000
5,000,000
5.99%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071977
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR and IO
0%
70%
10,000
2,000,000
6.04%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071978
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR and IO
0%
75%
10,000
2,000,000
6.04%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071979
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR
0%
80%
10,000
1,000,000
6.14%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
Continued
4071980
Fixed
31.03.25
Fee Saver*
2 Year Fixed
FTB
CR
0%
85%
10,000
750,000
6.19%
5.04%
5.4
0
No
0
Yes
No
No
Yes
Please refer to HSBC policy guide
Yes
17
Yes
Yes
nan
and saves data.csv
(screenshot from LibreOffice):
A minimal change to your code: pd.read_html returns a list of dataframes for all tables found on the webpage.
Since there is only one table on your page, data1
is a list of 1 dataframe. This is where the error Must pass 2-d input. shape=(1, 38, 12)
comes from – data1
contains 1 dataframe of shape (38, 12).
You probably want to do just:
data2 = data1[0]
data2.to_csv(...)
(Also, no need to sleep after reading from the webpage)
I’m trying to turn a list of tables on this page into a Pandas DataFrame:
https://intermediaries.hsbc.co.uk/products/product-finder/
I want to select the customer type box only and select one of the elements (from first to last) and then click find product to display the table for each one before concatenating all the DataFrames into 1 DataFrame.
So far I have managed to select the first element and print the table but I can’t seem to turn it into a pandas DataFrame as I get a value error: Must pass 2-d input. shape=(1, 38, 12)
This is my code:
def product_type_button(self):
select = Select(self.driver.find_element_by_id('Availability'))
try:
select.select_by_visible_text('First time buyer')
except NoSuchElementException:
print('The item does not exist')
time.sleep(5)
self.driver.find_element_by_xpath('//button[@type="button" and (contains(text(),"Find product"))]').click()
time.sleep(5)
def create_dataframe(self):
data1 = pd.read_html(self.driver.page_source)
print(data1)
data2 = pd.DataFrame(data1)
time.sleep(5)
data2.to_csv('Data1.csv')
I would like to find a way to print the table for each element, maybe selecting by index instead? and then concatenating into one DataFrame. Any help would be appreciated.
All data for the table is located inside javascript file. You can use re
/json
to parse it and then construct the dataframe:
import re
import json
import requests
import pandas as pd
js_src = "https://intermediaries.hsbc.co.uk/component---src-pages-products-product-finder-js-9c7004fb8446c3fe0a07.js"
data = requests.get(js_src).text
data = re.search(r"JSON.parse('(.*)')", data).group(1)
data = json.loads(data)
df = pd.DataFrame(data)
print(df.head().to_markdown(index=False))
df.to_csv("data.csv", index=False)
Prints:
Changed | NewProductCode | PreviousProductCode | ProductType | Deal Period (Fixed until) | ProductDescription1 | ProductTerm | Availability | Repayment Basis | Min LTV % | MaxLTV | Minimum Loan ? | Fees Payable Per | Rate | Reversionary Rate % | APR | BookingFee | Cashback | CashbackValue | ERC – Payable | Unlimited lump sum payments Premitted (without fees) | Unlimited overpayment permitted (without fees) | Overpayments | ERC | Portable | Completionfee | Free Legals for Remortgage | FreeStandardValuation | Loading |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Continued | 4071976 | Fixed | 31.03.25 | Fee Saver* | 2 Year Fixed | FTB | CR and IO | 0% | 60% | 10,000 | 5,000,000 | 5.99% | 5.04% | 5.4 | 0 | No | 0 | Yes | No | No | Yes | Please refer to HSBC policy guide | Yes | 17 | Yes | Yes | nan | |
Continued | 4071977 | Fixed | 31.03.25 | Fee Saver* | 2 Year Fixed | FTB | CR and IO | 0% | 70% | 10,000 | 2,000,000 | 6.04% | 5.04% | 5.4 | 0 | No | 0 | Yes | No | No | Yes | Please refer to HSBC policy guide | Yes | 17 | Yes | Yes | nan | |
Continued | 4071978 | Fixed | 31.03.25 | Fee Saver* | 2 Year Fixed | FTB | CR and IO | 0% | 75% | 10,000 | 2,000,000 | 6.04% | 5.04% | 5.4 | 0 | No | 0 | Yes | No | No | Yes | Please refer to HSBC policy guide | Yes | 17 | Yes | Yes | nan | |
Continued | 4071979 | Fixed | 31.03.25 | Fee Saver* | 2 Year Fixed | FTB | CR | 0% | 80% | 10,000 | 1,000,000 | 6.14% | 5.04% | 5.4 | 0 | No | 0 | Yes | No | No | Yes | Please refer to HSBC policy guide | Yes | 17 | Yes | Yes | nan | |
Continued | 4071980 | Fixed | 31.03.25 | Fee Saver* | 2 Year Fixed | FTB | CR | 0% | 85% | 10,000 | 750,000 | 6.19% | 5.04% | 5.4 | 0 | No | 0 | Yes | No | No | Yes | Please refer to HSBC policy guide | Yes | 17 | Yes | Yes | nan |
and saves data.csv
(screenshot from LibreOffice):
A minimal change to your code: pd.read_html returns a list of dataframes for all tables found on the webpage.
Since there is only one table on your page, data1
is a list of 1 dataframe. This is where the error Must pass 2-d input. shape=(1, 38, 12)
comes from – data1
contains 1 dataframe of shape (38, 12).
You probably want to do just:
data2 = data1[0]
data2.to_csv(...)
(Also, no need to sleep after reading from the webpage)