How to scrape data from new (2023) PGA Tour website in Python
Question:
The PGA tour updated their website (as of Feb 7, 2023) that completely broke the way I was scraping it for data. It used to have a "hidden" URL that you could uncover by looking at the Network tab in Developer tools. Then I could use that "hidden" URL with Requests in Python to pull the data tables.
For background on how it used to work, see the response from this previous post of mine: What to do when Python requests.get gets a browser error from the website?.
Now it seems like all the data is obscured away from accessing it via a URL like before. I’m hoping someone more fluent in web-scraping tricks can point me in the right direction to do what that previous link did:
- For any tournament, be able to pull tournament history from any year/season. (Example from the new site: https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results)
- For any statistic, be able to pull stats from any year/season. (Example from the new site: https://www.pgatour.com/stats/detail/02674)
Initial try shows ability to pull the table off the current page (but not previous years) and some of the data that is pulled is not text, but rather formatting code.
import requests
import pandas as pd
tournament_url = 'https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results'
headers = {'User-Agent': 'Mozilla/5.0'}
t = pd.read_html(requests.get(tournament_url, headers=headers).text)[0]
t
EDIT: I see from a response below that this is using GraphQL. I discovered that if you click on the graphql
line in the Network tab and then look at the Payload tab, you’ll see these variables: { "tournamentPastResultsId": "R2023464", "year": 2022 }
.
These seem to give the tournament ID and year in question so that in theory you can simply update these values in a query and pick any tournament, any year. Integrating these into the scraping would mimic how it was done prior. I’m not sure how to do that though. I’ll do some more research on Selenium. Hopefully it is able to pass through these variables somehow.
EDIT 2:
The answer was given below for how to do this for the tournament data. For Stats data (e.g. https://www.pgatour.com/stats/detail/02567), I was able to modify the code to get the appropriate table (see below).
Posted below for reference (thanks to @Jurakin!)
import pandas as pd
from numpy import NaN
import requests
# in the requests header seems to be a constant token ('x-api-key') that is needed
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"
YEAR = 2022 # Stats Season
STAT_ID = "02567" # Stat ID
# prepare the payload
payload = {
"operationName": "StatDetails",
"variables": {
"tourCode": "R",
"statId": STAT_ID,
"year": YEAR,
"eventQuery": None
},
"query": "query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {n statDetails(n tourCode: $tourCoden statId: $statIdn year: $yearn eventQuery: $eventQueryn ) {n tourCoden yearn displaySeasonn statIdn statTypen tournamentPills {n tournamentIdn displayNamen }n yearPills {n yearn displaySeasonn }n statTitlen statDescriptionn tourAvgn lastProcessedn statHeadersn statCategories {n categoryn displayNamen subCategories {n displayNamen stats {n statIdn statTitlen }n }n }n rows {n ... on StatDetailsPlayer {n __typenamen playerIdn playerNamen countryn countryFlagn rankn rankDiffn rankChangeTendencyn stats {n statNamen statValuen colorn }n }n ... on StatDetailTourAvg {n __typenamen displayNamen valuen }n }n }n}"
}
# post the request
page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})
# check for status code
page.raise_for_status()
# get the data
data = page.json()["data"]["statDetails"]["rows"]
# print(data)
# format to a table that is in the webpage
table = map(lambda item: {
"rank": item["rank"],
"player": item["playerName"],
"average": item["stats"][0]["statValue"],
}, data)
# convert the dataframe
s = pd.DataFrame(table)
s
EDIT 3 – FOLLOW UP QUESTION:
The answer above for stats work for stats with 5 characters in the Stat ID. But there are others with 3 characters (e.g. https://www.pgatour.com/stats/detail/156) that do grab the data correctly, but fail in the table mapping portion despite what I can tell are identical Response formats, so I am at a loss why this does not work and the other does.
import pandas as pd
from numpy import NaN
import requests
# in the requests header seems to be a constant token ('x-api-key') that is needed
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"
YEAR = 2022 # Stats Season
# STAT_ID = "02567" # Stat ID SGOTT
STAT_ID = "156" # Birdie Average - doesn't work for stats that only have three numbers and I can't figure out why
# prepare the payload
payload = {
"operationName": "StatDetails",
"variables": {
"tourCode": "R",
"statId": STAT_ID,
"year": YEAR,
"eventQuery": None
},
"query": "query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {n statDetails(n tourCode: $tourCoden statId: $statIdn year: $yearn eventQuery: $eventQueryn ) {n tourCoden yearn displaySeasonn statIdn statTypen tournamentPills {n tournamentIdn displayNamen }n yearPills {n yearn displaySeasonn }n statTitlen statDescriptionn tourAvgn lastProcessedn statHeadersn statCategories {n categoryn displayNamen subCategories {n displayNamen stats {n statIdn statTitlen }n }n }n rows {n ... on StatDetailsPlayer {n __typenamen playerIdn playerNamen countryn countryFlagn rankn rankDiffn rankChangeTendencyn stats {n statNamen statValuen colorn }n }n ... on StatDetailTourAvg {n __typenamen displayNamen valuen }n }n }n}"
}
# post the request
page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})
# check for status code
page.raise_for_status()
# get the data
data = page.json()["data"]["statDetails"]["rows"]
print(data)
# format to a table that is in the webpage
table = map(lambda item: {
"RANK": item["rank"],
"PLAYER": item["playerName"],
"AVERAGE": item["stats"][0]["statValue"],
}, data)
# convert the dataframe
s = pd.DataFrame(table)
s
Answers:
As you can see in devtools, the page uses graphql
. graphql
is a bit complicated for me and would take a long time to deobfuscate and understand the code, so I used selenium4
to run the javascript and build the table.
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
driver = webdriver.Chrome()
# load page
driver.get("https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results")
# get table
table = driver.find_element(By.CSS_SELECTOR, "table.chakra-table")
assert table, "table not found"
# remove empty rows
driver.execute_script("""arguments[0].querySelectorAll("td.css-1au52ex").forEach((e) => e.parentElement.remove())""", table)
# get html of the table
table_html = table.get_attribute("outerHTML")
# quit selenium
driver.quit()
df = pd.read_html(table_html)[0]
print(df)
Outputs:
Pos Player R1 R2 R3 R4 To Par FedExCup Pts Official Money
0 1 Max Homa -7 -5 E -4 -16 500.0 $1,440,000
1 2 Danny Willett -4 -8 E -3 -15 300.0 $872,000
2 3 Taylor Montgomery -4 -1 E -8 -13 190.0 $552,000
3 T4 Justin Lower -9 -1 -3 +1 -12 122.5 $360,000
4 T4 Byeong Hun An -6 -4 -1 -1 -12 122.5 $360,000
.. ... ... .. ... ... ... ... ... ...
151 CUT Doc Redman +2 +6 NaN NaN +8 0.0 $0
152 CUT Kyle Stanley +6 +2 NaN NaN +8 0.0 $0
153 CUT Jim Herman -1 +10 NaN NaN +9 0.0 $0
154 CUT Taylor Lowe +9 +8 NaN NaN +17 0.0 $0
155 W/D Brandon Matthews - NaN NaN NaN E 0.0 $0
[156 rows x 9 columns]
EDIT:
I created script that uses graphql
api to fetch the data as you told me in the comments.
import pandas as pd
from numpy import NaN
import requests
# in the requests header seems to be a constant token
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"
YEAR = 2023
PAST_RESULTS_ID = "R2023464"
# prepare the payload
payload = {
"operationName": "TournamentPastResults",
"variables": {
"tournamentPastResultsId": PAST_RESULTS_ID,
"year": YEAR
},
"query": "query TournamentPastResults($tournamentPastResultsId: ID!, $year: Int) {n tournamentPastResults(id: $tournamentPastResultsId, year: $year) {n idn players {n idn positionn player {n idn firstNamen lastNamen shortNamen displayNamen abbreviationsn abbreviationsAccessibilityTextn amateurn countryn countryFlagn lineColorn }n rounds {n scoren parRelativeScoren }n additionalDatan totaln parRelativeScoren }n roundsn additionalDataHeadersn availableSeasons {n yearn displaySeasonn }n winner {n idn firstNamen lastNamen totalStrokesn totalScoren countryFlagn countryNamen pursen pointsn }n }n}"
}
# post the request
page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})
# check for status code
page.raise_for_status()
# get the data
data = page.json()["data"]["tournamentPastResults"]["players"]
# format to a table that is in the webpage
table = map(lambda item: {
"pos": item["position"],
"player": item["player"]["displayName"],
"r1": item["rounds"][0]["parRelativeScore"] if len(item["rounds"]) > 0 else NaN,
"r2": item["rounds"][1]["parRelativeScore"] if len(item["rounds"]) > 1 else NaN,
"r3": item["rounds"][2]["parRelativeScore"] if len(item["rounds"]) > 2 else NaN,
"r4": item["rounds"][3]["parRelativeScore"] if len(item["rounds"]) > 3 else NaN,
"to par": item["parRelativeScore"],
"fedexcup pts": item["additionalData"][0],
"official money": item["additionalData"][1],
}, data)
# convert the dataframe
df = pd.DataFrame(table)
print(df)
EDIT 3:
The code raises a KeyError: 'rank'
error because the item does not have a rank attribute. I used the following code to get an invalid item:
# get the data
data = page.json()["data"]["statDetails"]["rows"]
for item in data:
if "rank" not in item:
print(item)
# Outputs:
# {"__typename": "StatDetailTourAvg", "displayName": "Tour Average", "value": "3.64"},
As you can see, his __typename
is different from all the others. I found two solutions:
Solution A
Filter out items that’s __typename
is not equal to StatDetailsPlayer
:
...
# get the data
data = page.json()["data"]["statDetails"]["rows"]
# print(data)
# filter out items, thats __typename is not "StatDetailsPlayer" like
# {"__typename": "StatDetailTourAvg", "displayName": "Tour Average", "value": "3.64"}
data = filter(lambda item: item.get("__typename", NaN) == "StatDetailsPlayer", data)
# format to a table that is in the webpage
table = map(lambda item: {
"RANK": item["rank"],
"PLAYER": item["playerName"],
"AVERAGE": item["stats"][0]["statValue"],
}, data)
# convert the dataframe
s = pd.DataFrame(table)
print(s)
Solution B
Attempts to retrieve attributes from the object if possible, otherwise returns NaN
.
...
def get(obj: object, keys: list, default=NaN):
"""
obj = {"a": {"b": {"c": [0, 1, 2, 3]}}}
keys = ["a", "b", "c", 0]
# returns 0
out = get(obj, keys, default=NaN)
# return NaN
out = get(obj, ["a", "c"])
"""
for key in keys:
try:
obj = obj[key]
except KeyError:
return default
return obj
# format to a table that is in the webpage
table = map(lambda item: {
"RANK": item.get("rank", NaN), # NaN is default (using buit-in function)
"PLAYER": item.get("playerName", NaN),
"AVERAGE": get(item, ["stats", 0, "statValue"] default=NaN), # my function (built-in function does not support multiple keys)
}, data)
# convert the dataframe
s = pd.DataFrame(table)
print(s)
Difference
Solution A does not contain invalid row, Solution B does.
The PGA tour updated their website (as of Feb 7, 2023) that completely broke the way I was scraping it for data. It used to have a "hidden" URL that you could uncover by looking at the Network tab in Developer tools. Then I could use that "hidden" URL with Requests in Python to pull the data tables.
For background on how it used to work, see the response from this previous post of mine: What to do when Python requests.get gets a browser error from the website?.
Now it seems like all the data is obscured away from accessing it via a URL like before. I’m hoping someone more fluent in web-scraping tricks can point me in the right direction to do what that previous link did:
- For any tournament, be able to pull tournament history from any year/season. (Example from the new site: https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results)
- For any statistic, be able to pull stats from any year/season. (Example from the new site: https://www.pgatour.com/stats/detail/02674)
Initial try shows ability to pull the table off the current page (but not previous years) and some of the data that is pulled is not text, but rather formatting code.
import requests
import pandas as pd
tournament_url = 'https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results'
headers = {'User-Agent': 'Mozilla/5.0'}
t = pd.read_html(requests.get(tournament_url, headers=headers).text)[0]
t
EDIT: I see from a response below that this is using GraphQL. I discovered that if you click on the graphql
line in the Network tab and then look at the Payload tab, you’ll see these variables: { "tournamentPastResultsId": "R2023464", "year": 2022 }
.
These seem to give the tournament ID and year in question so that in theory you can simply update these values in a query and pick any tournament, any year. Integrating these into the scraping would mimic how it was done prior. I’m not sure how to do that though. I’ll do some more research on Selenium. Hopefully it is able to pass through these variables somehow.
EDIT 2:
The answer was given below for how to do this for the tournament data. For Stats data (e.g. https://www.pgatour.com/stats/detail/02567), I was able to modify the code to get the appropriate table (see below).
Posted below for reference (thanks to @Jurakin!)
import pandas as pd
from numpy import NaN
import requests
# in the requests header seems to be a constant token ('x-api-key') that is needed
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"
YEAR = 2022 # Stats Season
STAT_ID = "02567" # Stat ID
# prepare the payload
payload = {
"operationName": "StatDetails",
"variables": {
"tourCode": "R",
"statId": STAT_ID,
"year": YEAR,
"eventQuery": None
},
"query": "query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {n statDetails(n tourCode: $tourCoden statId: $statIdn year: $yearn eventQuery: $eventQueryn ) {n tourCoden yearn displaySeasonn statIdn statTypen tournamentPills {n tournamentIdn displayNamen }n yearPills {n yearn displaySeasonn }n statTitlen statDescriptionn tourAvgn lastProcessedn statHeadersn statCategories {n categoryn displayNamen subCategories {n displayNamen stats {n statIdn statTitlen }n }n }n rows {n ... on StatDetailsPlayer {n __typenamen playerIdn playerNamen countryn countryFlagn rankn rankDiffn rankChangeTendencyn stats {n statNamen statValuen colorn }n }n ... on StatDetailTourAvg {n __typenamen displayNamen valuen }n }n }n}"
}
# post the request
page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})
# check for status code
page.raise_for_status()
# get the data
data = page.json()["data"]["statDetails"]["rows"]
# print(data)
# format to a table that is in the webpage
table = map(lambda item: {
"rank": item["rank"],
"player": item["playerName"],
"average": item["stats"][0]["statValue"],
}, data)
# convert the dataframe
s = pd.DataFrame(table)
s
EDIT 3 – FOLLOW UP QUESTION:
The answer above for stats work for stats with 5 characters in the Stat ID. But there are others with 3 characters (e.g. https://www.pgatour.com/stats/detail/156) that do grab the data correctly, but fail in the table mapping portion despite what I can tell are identical Response formats, so I am at a loss why this does not work and the other does.
import pandas as pd
from numpy import NaN
import requests
# in the requests header seems to be a constant token ('x-api-key') that is needed
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"
YEAR = 2022 # Stats Season
# STAT_ID = "02567" # Stat ID SGOTT
STAT_ID = "156" # Birdie Average - doesn't work for stats that only have three numbers and I can't figure out why
# prepare the payload
payload = {
"operationName": "StatDetails",
"variables": {
"tourCode": "R",
"statId": STAT_ID,
"year": YEAR,
"eventQuery": None
},
"query": "query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {n statDetails(n tourCode: $tourCoden statId: $statIdn year: $yearn eventQuery: $eventQueryn ) {n tourCoden yearn displaySeasonn statIdn statTypen tournamentPills {n tournamentIdn displayNamen }n yearPills {n yearn displaySeasonn }n statTitlen statDescriptionn tourAvgn lastProcessedn statHeadersn statCategories {n categoryn displayNamen subCategories {n displayNamen stats {n statIdn statTitlen }n }n }n rows {n ... on StatDetailsPlayer {n __typenamen playerIdn playerNamen countryn countryFlagn rankn rankDiffn rankChangeTendencyn stats {n statNamen statValuen colorn }n }n ... on StatDetailTourAvg {n __typenamen displayNamen valuen }n }n }n}"
}
# post the request
page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})
# check for status code
page.raise_for_status()
# get the data
data = page.json()["data"]["statDetails"]["rows"]
print(data)
# format to a table that is in the webpage
table = map(lambda item: {
"RANK": item["rank"],
"PLAYER": item["playerName"],
"AVERAGE": item["stats"][0]["statValue"],
}, data)
# convert the dataframe
s = pd.DataFrame(table)
s
As you can see in devtools, the page uses graphql
. graphql
is a bit complicated for me and would take a long time to deobfuscate and understand the code, so I used selenium4
to run the javascript and build the table.
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
driver = webdriver.Chrome()
# load page
driver.get("https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results")
# get table
table = driver.find_element(By.CSS_SELECTOR, "table.chakra-table")
assert table, "table not found"
# remove empty rows
driver.execute_script("""arguments[0].querySelectorAll("td.css-1au52ex").forEach((e) => e.parentElement.remove())""", table)
# get html of the table
table_html = table.get_attribute("outerHTML")
# quit selenium
driver.quit()
df = pd.read_html(table_html)[0]
print(df)
Outputs:
Pos Player R1 R2 R3 R4 To Par FedExCup Pts Official Money
0 1 Max Homa -7 -5 E -4 -16 500.0 $1,440,000
1 2 Danny Willett -4 -8 E -3 -15 300.0 $872,000
2 3 Taylor Montgomery -4 -1 E -8 -13 190.0 $552,000
3 T4 Justin Lower -9 -1 -3 +1 -12 122.5 $360,000
4 T4 Byeong Hun An -6 -4 -1 -1 -12 122.5 $360,000
.. ... ... .. ... ... ... ... ... ...
151 CUT Doc Redman +2 +6 NaN NaN +8 0.0 $0
152 CUT Kyle Stanley +6 +2 NaN NaN +8 0.0 $0
153 CUT Jim Herman -1 +10 NaN NaN +9 0.0 $0
154 CUT Taylor Lowe +9 +8 NaN NaN +17 0.0 $0
155 W/D Brandon Matthews - NaN NaN NaN E 0.0 $0
[156 rows x 9 columns]
EDIT:
I created script that uses graphql
api to fetch the data as you told me in the comments.
import pandas as pd
from numpy import NaN
import requests
# in the requests header seems to be a constant token
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"
YEAR = 2023
PAST_RESULTS_ID = "R2023464"
# prepare the payload
payload = {
"operationName": "TournamentPastResults",
"variables": {
"tournamentPastResultsId": PAST_RESULTS_ID,
"year": YEAR
},
"query": "query TournamentPastResults($tournamentPastResultsId: ID!, $year: Int) {n tournamentPastResults(id: $tournamentPastResultsId, year: $year) {n idn players {n idn positionn player {n idn firstNamen lastNamen shortNamen displayNamen abbreviationsn abbreviationsAccessibilityTextn amateurn countryn countryFlagn lineColorn }n rounds {n scoren parRelativeScoren }n additionalDatan totaln parRelativeScoren }n roundsn additionalDataHeadersn availableSeasons {n yearn displaySeasonn }n winner {n idn firstNamen lastNamen totalStrokesn totalScoren countryFlagn countryNamen pursen pointsn }n }n}"
}
# post the request
page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})
# check for status code
page.raise_for_status()
# get the data
data = page.json()["data"]["tournamentPastResults"]["players"]
# format to a table that is in the webpage
table = map(lambda item: {
"pos": item["position"],
"player": item["player"]["displayName"],
"r1": item["rounds"][0]["parRelativeScore"] if len(item["rounds"]) > 0 else NaN,
"r2": item["rounds"][1]["parRelativeScore"] if len(item["rounds"]) > 1 else NaN,
"r3": item["rounds"][2]["parRelativeScore"] if len(item["rounds"]) > 2 else NaN,
"r4": item["rounds"][3]["parRelativeScore"] if len(item["rounds"]) > 3 else NaN,
"to par": item["parRelativeScore"],
"fedexcup pts": item["additionalData"][0],
"official money": item["additionalData"][1],
}, data)
# convert the dataframe
df = pd.DataFrame(table)
print(df)
EDIT 3:
The code raises a KeyError: 'rank'
error because the item does not have a rank attribute. I used the following code to get an invalid item:
# get the data
data = page.json()["data"]["statDetails"]["rows"]
for item in data:
if "rank" not in item:
print(item)
# Outputs:
# {"__typename": "StatDetailTourAvg", "displayName": "Tour Average", "value": "3.64"},
As you can see, his __typename
is different from all the others. I found two solutions:
Solution A
Filter out items that’s __typename
is not equal to StatDetailsPlayer
:
...
# get the data
data = page.json()["data"]["statDetails"]["rows"]
# print(data)
# filter out items, thats __typename is not "StatDetailsPlayer" like
# {"__typename": "StatDetailTourAvg", "displayName": "Tour Average", "value": "3.64"}
data = filter(lambda item: item.get("__typename", NaN) == "StatDetailsPlayer", data)
# format to a table that is in the webpage
table = map(lambda item: {
"RANK": item["rank"],
"PLAYER": item["playerName"],
"AVERAGE": item["stats"][0]["statValue"],
}, data)
# convert the dataframe
s = pd.DataFrame(table)
print(s)
Solution B
Attempts to retrieve attributes from the object if possible, otherwise returns NaN
.
...
def get(obj: object, keys: list, default=NaN):
"""
obj = {"a": {"b": {"c": [0, 1, 2, 3]}}}
keys = ["a", "b", "c", 0]
# returns 0
out = get(obj, keys, default=NaN)
# return NaN
out = get(obj, ["a", "c"])
"""
for key in keys:
try:
obj = obj[key]
except KeyError:
return default
return obj
# format to a table that is in the webpage
table = map(lambda item: {
"RANK": item.get("rank", NaN), # NaN is default (using buit-in function)
"PLAYER": item.get("playerName", NaN),
"AVERAGE": get(item, ["stats", 0, "statValue"] default=NaN), # my function (built-in function does not support multiple keys)
}, data)
# convert the dataframe
s = pd.DataFrame(table)
print(s)
Difference
Solution A does not contain invalid row, Solution B does.