What's the fastest way to turn json results from an API into a dataframe?

Question:

Below is an example of sports betting app I’m working on.

games.json()[‘data’] – contains the game id for each sport event for that day. The API then returns the odds for that specific game.

What’s the fastest option to take json and turn it into a panda dataframe? currently looking into msgspec.

Some games can have over 5K total bets

master_df = pd.DataFrame()  

for game in games.json()['data']:
            
    odds_params = {'key': api_key, 'game_id': game['id'], 'sportsbook': sportsbooks}
    odds = requests.get(api_url, params=odds_params)
    for o in odds.json()['data'][0]['odds']:
        temp = pd.DataFrame()
        temp['id'] = [game['id']]
        for k,v in game.items():
            if k != 'id' and k != 'is_live':
                temp[k] = v
                
        for k, v in o.items():
            if k == 'id':
                temp['odds_id'] = v
            else:
                temp[k] = v
                
        if len(master_df) == 0:
            master_df = temp
        else:
            master_df = pd.concat([master_df, temp])  

odds.json response snippet –

{'data': [{'id': '35142-30886-2023-02-08',
   'sport': 'basketball',
   'league': 'NBA',
   'start_date': '2023-02-08T19:10:00-05:00',
   'home_team': 'Washington Wizards',
   'away_team': 'Charlotte Hornets',
   'is_live': False,
   'tournament': None,
   'status': 'unplayed',
   'odds': [{'id': '4BB426518ECF',
     'sports_book_name': 'Betfred',
     'name': 'Charlotte Hornets',
     'price': 135.0,
     'checked_date': '2023-02-08T11:46:12-05:00',
     'bet_points': None,
     'is_main': True,
     'is_live': False,
     'market_name': '1st Half Moneyline',
     'home_rotation_number': None,
     'away_rotation_number': None,
     'deep_link_url': None,
     'player_id': None},  
     ....

By the end of this process, I usually have about 30K records in the dataframe

Asked By: bbennett36

||

Answers:

Here is what I would do.

def _create_record_(game: dict, odds: dict) -> dict:
    """
    Warning: THIS MUTATES THE INPUT
    """
    odds['id'] = "odds_id"
    # the pipe | operator is only available in dicts in recent versions of python
    # use dict(**game, **odds) if you get a TypeError
    result = game | odds
    result.pop("is_live")
    return result

def _get_odds(game: dict) -> list:
    params = {'key': api_key, 'game_id': game['id'], 'sportsbook': sportsbooks}
    return requests.get(api_url, params=params).json()['data'][0]['odds']

df = pd.DataFrame(
     [
         _create_record_(game, odds) 
         for game in games.json()['data']
         for odds in _get_odds(game)
    ]
)

The fact that it is in this list comprehenesion isn’t relevant. And equivalent for-loop would work just as well, the point is you create a list of dicts first, then create your dataframe. This avoids the quadratic time behavior of incrementally creating a dataframe using pd.concat.

Answered By: juanpa.arrivillaga