How to scrape data from new (2023) PGA Tour website in Python

Question:

The PGA tour updated their website (as of Feb 7, 2023) that completely broke the way I was scraping it for data. It used to have a "hidden" URL that you could uncover by looking at the Network tab in Developer tools. Then I could use that "hidden" URL with Requests in Python to pull the data tables.

For background on how it used to work, see the response from this previous post of mine: What to do when Python requests.get gets a browser error from the website?.

Now it seems like all the data is obscured away from accessing it via a URL like before. I’m hoping someone more fluent in web-scraping tricks can point me in the right direction to do what that previous link did:

  1. For any tournament, be able to pull tournament history from any year/season. (Example from the new site: https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results)
  2. For any statistic, be able to pull stats from any year/season. (Example from the new site: https://www.pgatour.com/stats/detail/02674)

Initial try shows ability to pull the table off the current page (but not previous years) and some of the data that is pulled is not text, but rather formatting code.

import requests
import pandas as pd

tournament_url = 'https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results'
headers = {'User-Agent': 'Mozilla/5.0'}
t = pd.read_html(requests.get(tournament_url, headers=headers).text)[0]
t

EDIT: I see from a response below that this is using GraphQL. I discovered that if you click on the graphql line in the Network tab and then look at the Payload tab, you’ll see these variables: { "tournamentPastResultsId": "R2023464", "year": 2022 }.

These seem to give the tournament ID and year in question so that in theory you can simply update these values in a query and pick any tournament, any year. Integrating these into the scraping would mimic how it was done prior. I’m not sure how to do that though. I’ll do some more research on Selenium. Hopefully it is able to pass through these variables somehow.

EDIT 2:
The answer was given below for how to do this for the tournament data. For Stats data (e.g. https://www.pgatour.com/stats/detail/02567), I was able to modify the code to get the appropriate table (see below).

Posted below for reference (thanks to @Jurakin!)

import pandas as pd
from numpy import NaN
import requests

# in the requests header seems to be a constant token ('x-api-key') that is needed
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"

YEAR = 2022  # Stats Season
STAT_ID = "02567"  # Stat ID

# prepare the payload
payload = {
    "operationName": "StatDetails",
    "variables": {
        "tourCode": "R",
        "statId": STAT_ID,
        "year": YEAR,
        "eventQuery": None
    },
    "query": "query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {n  statDetails(n    tourCode: $tourCoden    statId: $statIdn    year: $yearn    eventQuery: $eventQueryn  ) {n    tourCoden    yearn    displaySeasonn    statIdn    statTypen    tournamentPills {n      tournamentIdn      displayNamen    }n    yearPills {n      yearn      displaySeasonn    }n    statTitlen    statDescriptionn    tourAvgn    lastProcessedn    statHeadersn    statCategories {n      categoryn      displayNamen      subCategories {n        displayNamen        stats {n          statIdn          statTitlen        }n      }n    }n    rows {n      ... on StatDetailsPlayer {n        __typenamen        playerIdn        playerNamen        countryn        countryFlagn        rankn        rankDiffn        rankChangeTendencyn        stats {n          statNamen          statValuen          colorn        }n      }n      ... on StatDetailTourAvg {n        __typenamen        displayNamen        valuen      }n    }n  }n}"  
  }

# post the request
page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})

# check for status code
page.raise_for_status()

# get the data
data = page.json()["data"]["statDetails"]["rows"]

# print(data)

# format to a table that is in the webpage
table = map(lambda item: {
    "rank": item["rank"],
    "player": item["playerName"],
    "average": item["stats"][0]["statValue"],
}, data)

# convert the dataframe
s = pd.DataFrame(table)

s

EDIT 3 – FOLLOW UP QUESTION:
The answer above for stats work for stats with 5 characters in the Stat ID. But there are others with 3 characters (e.g. https://www.pgatour.com/stats/detail/156) that do grab the data correctly, but fail in the table mapping portion despite what I can tell are identical Response formats, so I am at a loss why this does not work and the other does.

import pandas as pd
from numpy import NaN
import requests

# in the requests header seems to be a constant token ('x-api-key') that is needed
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"

YEAR = 2022  # Stats Season
# STAT_ID = "02567"  # Stat ID SGOTT
STAT_ID = "156"  # Birdie Average - doesn't work for stats that only have three numbers and I can't figure out why

# prepare the payload
payload = {
    "operationName": "StatDetails",
    "variables": {
        "tourCode": "R",
        "statId": STAT_ID,
        "year": YEAR,
        "eventQuery": None
    },
    "query": "query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {n  statDetails(n    tourCode: $tourCoden    statId: $statIdn    year: $yearn    eventQuery: $eventQueryn  ) {n    tourCoden    yearn    displaySeasonn    statIdn    statTypen    tournamentPills {n      tournamentIdn      displayNamen    }n    yearPills {n      yearn      displaySeasonn    }n    statTitlen    statDescriptionn    tourAvgn    lastProcessedn    statHeadersn    statCategories {n      categoryn      displayNamen      subCategories {n        displayNamen        stats {n          statIdn          statTitlen        }n      }n    }n    rows {n      ... on StatDetailsPlayer {n        __typenamen        playerIdn        playerNamen        countryn        countryFlagn        rankn        rankDiffn        rankChangeTendencyn        stats {n          statNamen          statValuen          colorn        }n      }n      ... on StatDetailTourAvg {n        __typenamen        displayNamen        valuen      }n    }n  }n}"  
  }

# post the request
page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})

# check for status code
page.raise_for_status()

# get the data
data = page.json()["data"]["statDetails"]["rows"]

print(data)

# format to a table that is in the webpage
table = map(lambda item: {
    "RANK": item["rank"],
    "PLAYER": item["playerName"],
    "AVERAGE": item["stats"][0]["statValue"],
}, data)

# convert the dataframe
s = pd.DataFrame(table)

s
Asked By: Ryan Miller

||

Answers:

As you can see in devtools, the page uses graphql. graphql is a bit complicated for me and would take a long time to deobfuscate and understand the code, so I used selenium4 to run the javascript and build the table.

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

driver = webdriver.Chrome()

# load page
driver.get("https://www.pgatour.com/tournaments/2023/fortinet-championship/R2023464/past-results")

# get table
table = driver.find_element(By.CSS_SELECTOR, "table.chakra-table")
assert table, "table not found"

# remove empty rows
driver.execute_script("""arguments[0].querySelectorAll("td.css-1au52ex").forEach((e) => e.parentElement.remove())""", table)

# get html of the table
table_html = table.get_attribute("outerHTML")

# quit selenium
driver.quit()

df = pd.read_html(table_html)[0]

print(df)

Outputs:

     Pos             Player  R1   R2   R3   R4 To Par  FedExCup Pts Official Money
0      1           Max Homa  -7   -5    E   -4    -16         500.0     $1,440,000
1      2      Danny Willett  -4   -8    E   -3    -15         300.0       $872,000
2      3  Taylor Montgomery  -4   -1    E   -8    -13         190.0       $552,000
3     T4       Justin Lower  -9   -1   -3   +1    -12         122.5       $360,000
4     T4      Byeong Hun An  -6   -4   -1   -1    -12         122.5       $360,000
..   ...                ...  ..  ...  ...  ...    ...           ...            ...
151  CUT         Doc Redman  +2   +6  NaN  NaN     +8           0.0             $0
152  CUT       Kyle Stanley  +6   +2  NaN  NaN     +8           0.0             $0
153  CUT         Jim Herman  -1  +10  NaN  NaN     +9           0.0             $0
154  CUT        Taylor Lowe  +9   +8  NaN  NaN    +17           0.0             $0
155  W/D   Brandon Matthews   -  NaN  NaN  NaN      E           0.0             $0

[156 rows x 9 columns]

EDIT:

I created script that uses graphql api to fetch the data as you told me in the comments.

import pandas as pd
from numpy import NaN
import requests

# in the requests header seems to be a constant token
X_API_KEY = "da2-gsrx5bibzbb4njvhl7t37wqyl4"

YEAR = 2023
PAST_RESULTS_ID = "R2023464"

# prepare the payload
payload = {
    "operationName": "TournamentPastResults",
    "variables": {
        "tournamentPastResultsId": PAST_RESULTS_ID,
        "year": YEAR
    },
    "query": "query TournamentPastResults($tournamentPastResultsId: ID!, $year: Int) {n  tournamentPastResults(id: $tournamentPastResultsId, year: $year) {n    idn    players {n      idn      positionn      player {n        idn        firstNamen        lastNamen        shortNamen        displayNamen        abbreviationsn        abbreviationsAccessibilityTextn        amateurn        countryn        countryFlagn        lineColorn      }n      rounds {n        scoren        parRelativeScoren      }n      additionalDatan      totaln      parRelativeScoren    }n    roundsn    additionalDataHeadersn    availableSeasons {n      yearn      displaySeasonn    }n    winner {n      idn      firstNamen      lastNamen      totalStrokesn      totalScoren      countryFlagn      countryNamen      pursen      pointsn    }n  }n}"
}

# post the request
page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})

# check for status code
page.raise_for_status()

# get the data
data = page.json()["data"]["tournamentPastResults"]["players"]

# format to a table that is in the webpage
table = map(lambda item: {
    "pos": item["position"],
    "player": item["player"]["displayName"],
    "r1": item["rounds"][0]["parRelativeScore"] if len(item["rounds"]) > 0 else NaN,
    "r2": item["rounds"][1]["parRelativeScore"] if len(item["rounds"]) > 1 else NaN,
    "r3": item["rounds"][2]["parRelativeScore"] if len(item["rounds"]) > 2 else NaN,
    "r4": item["rounds"][3]["parRelativeScore"] if len(item["rounds"]) > 3 else NaN,
    "to par": item["parRelativeScore"],
    "fedexcup pts": item["additionalData"][0],
    "official money": item["additionalData"][1],
}, data)

# convert the dataframe
df = pd.DataFrame(table)

print(df)

EDIT 3:

The code raises a KeyError: 'rank' error because the item does not have a rank attribute. I used the following code to get an invalid item:

# get the data
data = page.json()["data"]["statDetails"]["rows"]

for item in data:
    if "rank" not in item:
        print(item)

# Outputs:
# {"__typename": "StatDetailTourAvg", "displayName": "Tour Average", "value": "3.64"},

As you can see, his __typename is different from all the others. I found two solutions:

Solution A

Filter out items that’s __typename is not equal to StatDetailsPlayer:

...

# get the data
data = page.json()["data"]["statDetails"]["rows"]

# print(data)

# filter out items, thats __typename is not "StatDetailsPlayer" like
# {"__typename": "StatDetailTourAvg", "displayName": "Tour Average", "value": "3.64"}
data = filter(lambda item: item.get("__typename", NaN) == "StatDetailsPlayer", data)

# format to a table that is in the webpage
table = map(lambda item: {
    "RANK": item["rank"],
    "PLAYER": item["playerName"],
    "AVERAGE": item["stats"][0]["statValue"],
}, data)


# convert the dataframe
s = pd.DataFrame(table)

print(s)

Solution B

Attempts to retrieve attributes from the object if possible, otherwise returns NaN.

...

def get(obj: object, keys: list, default=NaN):
    """
    obj = {"a": {"b": {"c": [0, 1, 2, 3]}}}
    keys = ["a", "b", "c", 0]
    # returns 0
    out = get(obj, keys, default=NaN)
    # return NaN
    out = get(obj, ["a", "c"])
    """
    for key in keys:
        try:
            obj = obj[key]
        except KeyError:
            return default
    return obj

# format to a table that is in the webpage
table = map(lambda item: {
    "RANK": item.get("rank", NaN), # NaN is default (using buit-in function)
    "PLAYER": item.get("playerName", NaN),
    "AVERAGE": get(item, ["stats", 0, "statValue"] default=NaN), # my function (built-in function does not support multiple keys)
}, data)

# convert the dataframe
s = pd.DataFrame(table)

print(s)

Difference

Solution A does not contain invalid row, Solution B does.

Answered By: Jurakin