Turn table-element into Pandas DataFrame
Question:
I would like to turn a table into a pandas.DataFrame
.
URL = 'https://ladieseuropeantour.com/reports-page/?tourn=1202&tclass=rnk&report=tmscores~season=2015~params=P*4ESC04~#/profile'
The element in question is
from selenium import webdriver
from selenium.webdriver.common.by import By
driver.get(URL)
ranking = driver.find_element(By.XPATH, ".//*[@id='maintablelive']")
I tried the following:
import pandas as pd
pd.read_html(ranking.get_attribute('outerHTML'))[0]
I am also using the dropdown-menu to select multiple rounds. When a different round is selected, driver.current_url
doesn’t change so I think it’s not possible to load these new tables with requests
or anything.
Please advice!
Answers:
Instead of using selenium
, you want to access the URL’s API endpoint.
Finding the API endpoint
You can trace it as follows:
- Open the URL in Chrome
- Use
Ctrl + Shift + J
to open DevTools, navigate to Network
, select Fetch/XHR
from the sub navbar, and refresh the URL.
- This will reload the network connections, and when you click on one of the lines that will appear, you can select
Response
from the second sub navbar to see if they are returning data.
- Going through them, we can locate the connection that is responsible for returning the data for the table, namely https://ladieseuropeantour.com/api/let/cache/let/2015/2015-1202-scores-P*4ESC04.json (you can leave out the query part
?randomadd=1673086921319
at the end).
Now, with this knowledge, we can simply load all the data with requests
, investigate the type of information that is contained in the JSON
and dump the required sub part in a df
.
For example, let’s recreate the table from your URL. This one:
Code
import requests
import pandas as pd
url = 'https://ladieseuropeantour.com/api/let/cache/let/2015/2015-1202-scores-P*4ESC04.json'
data = requests.get(url).json()
df = pd.DataFrame(data['scores']['scores_entry'])
cols = ['pos','name','nationality','vspar',
'score_R1','score_R2','score_R3','score_R4', 'score']
df_table = df.loc[:,cols]
df_table.head()
pos name nationality vspar score_R1 score_R2 score_R3
0 1 Lydia Ko (NZL) NZL -9 70 70 72
1 2 Amy Yang (KOR) KOR -7 73 70 70
2 3 Ariya Jutanugarn (THA) THA -4 69 71 72
3 4 Jenny Shin (KOR) KOR -2 76 71 74
4 4 Ilhee Lee (KOR) KOR -2 68 82 69
score_R4 score
0 71 283
1 72 285
2 76 288
3 69 290
4 71 290
If you check print(df.columns)
, you’ll see that the main df
contains all the data that lies behind all of the tables you can select on the URL. So, have a go at that to select whatever you are looking for.
I would like to turn a table into a pandas.DataFrame
.
URL = 'https://ladieseuropeantour.com/reports-page/?tourn=1202&tclass=rnk&report=tmscores~season=2015~params=P*4ESC04~#/profile'
The element in question is
from selenium import webdriver
from selenium.webdriver.common.by import By
driver.get(URL)
ranking = driver.find_element(By.XPATH, ".//*[@id='maintablelive']")
I tried the following:
import pandas as pd
pd.read_html(ranking.get_attribute('outerHTML'))[0]
I am also using the dropdown-menu to select multiple rounds. When a different round is selected, driver.current_url
doesn’t change so I think it’s not possible to load these new tables with requests
or anything.
Please advice!
Instead of using selenium
, you want to access the URL’s API endpoint.
Finding the API endpoint
You can trace it as follows:
- Open the URL in Chrome
- Use
Ctrl + Shift + J
to open DevTools, navigate toNetwork
, selectFetch/XHR
from the sub navbar, and refresh the URL. - This will reload the network connections, and when you click on one of the lines that will appear, you can select
Response
from the second sub navbar to see if they are returning data. - Going through them, we can locate the connection that is responsible for returning the data for the table, namely https://ladieseuropeantour.com/api/let/cache/let/2015/2015-1202-scores-P*4ESC04.json (you can leave out the query part
?randomadd=1673086921319
at the end).
Now, with this knowledge, we can simply load all the data with requests
, investigate the type of information that is contained in the JSON
and dump the required sub part in a df
.
For example, let’s recreate the table from your URL. This one:
Code
import requests
import pandas as pd
url = 'https://ladieseuropeantour.com/api/let/cache/let/2015/2015-1202-scores-P*4ESC04.json'
data = requests.get(url).json()
df = pd.DataFrame(data['scores']['scores_entry'])
cols = ['pos','name','nationality','vspar',
'score_R1','score_R2','score_R3','score_R4', 'score']
df_table = df.loc[:,cols]
df_table.head()
pos name nationality vspar score_R1 score_R2 score_R3
0 1 Lydia Ko (NZL) NZL -9 70 70 72
1 2 Amy Yang (KOR) KOR -7 73 70 70
2 3 Ariya Jutanugarn (THA) THA -4 69 71 72
3 4 Jenny Shin (KOR) KOR -2 76 71 74
4 4 Ilhee Lee (KOR) KOR -2 68 82 69
score_R4 score
0 71 283
1 72 285
2 76 288
3 69 290
4 71 290
If you check print(df.columns)
, you’ll see that the main df
contains all the data that lies behind all of the tables you can select on the URL. So, have a go at that to select whatever you are looking for.