How to pull values from JSON into pandas dataframe using a string as the index value

Question:

I am looking to build a dataframe utilizing certain data from a JSON with multiple nested dictionaries.

An example of the format of the JSON is as follows:

"leagueYear": "2021",
    "stats": {
        "Cincinnati Bengals": {
            "offense": {
                "firstDownPerG": {
                    "value": "19.82",
                    "rank": "18"
                },
                "thirdDownPct": {
                    "rank": "16",
                    "value": "39.6%"
                },
                "g": {
                    "value": "17"
                }
            }
            "defense": {
                "tacklesLoss": {
                    "rank": "9",
                    "value": "80"
                },
                "passDefended": {
                    "rank": "24",
                    "value": "63"
                }
        }
        "Carolina Panthers": {
            "offense": {
                "fumbles": {
                    "value": "20",
                    "rank": "13"
                },
                "passIntPerc": {
                    "value": "3.5",
                    "rank": "31"
                }
             }
           }

I am able to format to dataframe with the following code:

with open('2021.json') as json_file:
    team_dict = json.load(json_file)
    print(team_dict.keys())
    print(team_dict['leagueYear'])
    
teams = team_dict['stats']

df = pd.DataFrame()
df['Teams'] = teams.keys()

To get a dataframe that looks like:

Teams
0 Cincinati Bengals
1 Buffalo Bills

What I need to do is write a function that will pull selected keys based on team names as the index value.

So for the games value I can use…

print(teams['Buffalo Bills']['offense']['g']['value'])

…to get the value of how many games the Bills played, however I would like to write a function that would look at the value of the team name and pull the corresponding values. I need to pull about 20 values from each team, so having a way to iterate over the df rows as a function would be very useful.

My end goal is to write a function(s) that make my df look like:

Teams Games Played First Down Per Game
0 Cincinati Bengals 17 19.82
1 Buffalo Bills 17 Whatever it is

I’m sure this is a relatively easy ask, but I am very new to using JSON and building dataframes.

My closest guess, I don’t think the function or the df.apply is right:

def games_played(game):
  team = df['Team']
  games_played = game[team]['offense']['g']['value']
  return games_played 

df['Games Played'] = df.apply(lambda games_played: ) <--- getting lost here
Asked By: Tim D

||

Answers:

You should probably look at pd.json_normalize before trying to reinvent the wheel:

data = {'leagueYear': '2021', 'stats': {'Cincinnati Bengals': {'offense': {'firstDownPerG': {'value': '19.82', 'rank': '18'}, 'thirdDownPct': {'rank': '16', 'value': '39.6%'}, 'g': {'value': '17'}}}}}
df = pd.json_normalize(data)
print(df)

Output:

  leagueYear stats.Cincinnati Bengals.offense.firstDownPerG.value stats.Cincinnati Bengals.offense.firstDownPerG.rank stats.Cincinnati Bengals.offense.thirdDownPct.rank stats.Cincinnati Bengals.offense.thirdDownPct.value stats.Cincinnati Bengals.offense.g.value
0       2021                                              19.82                                                   18                                                  16                                              39.6%                                        1

Given (I had to clean up what you pasted a little to be valid JSON):

data = {
  "leagueYear": "2021",
  "stats": {
    "Cincinnati Bengals": {
      "offense": {
        "firstDownPerG": {
          "value": "19.82",
          "rank": "18"
        },
        "thirdDownPct": {
          "rank": "16",
          "value": "39.6%"
        },
        "g": {
          "value": "17"
        }
      },
      "defense": {
        "tacklesLoss": {
          "rank": "9",
          "value": "80"
        },
        "passDefended": {
          "rank": "24",
          "value": "63"
        }
      }
    },
    "Carolina Panthers": {
      "offense": {
        "fumbles": {
          "value": "20",
          "rank": "13"
        },
        "passIntPerc": {
          "value": "3.5",
          "rank": "31"
        }
      }
    }
  }
}

Doing:

# Stats will work better if it's a list, and formatted a little differently:
data['stats'] = [{'team': k} | v for k,v in data['stats'].items()]

# This makes data look like:

{
  "leagueYear": "2021",
  "stats": [
    {
      "team": "Cincinnati Bengals",
      "offense": {
        "firstDownPerG": {
          "value": "19.82",
          "rank": "18"
        },
        "thirdDownPct": {
          "rank": "16",
          "value": "39.6%"
        },
        "g": {
          "value": "17"
        }
      },
      "defense": {
        "tacklesLoss": {
          "rank": "9",
          "value": "80"
        },
        "passDefended": {
          "rank": "24",
          "value": "63"
        }
      }
    },
    {
      "team": "Carolina Panthers",
      "offense": {
        "fumbles": {
          "value": "20",
          "rank": "13"
        },
        "passIntPerc": {
          "value": "3.5",
          "rank": "31"
        }
      }
    }
  ]
}

Now, if we use pd.json_normalize

df = pd.json_normalize(data, ['stats'])
print(df)

# Output:

                 team offense.firstDownPerG.value offense.firstDownPerG.rank offense.thirdDownPct.rank offense.thirdDownPct.value offense.g.value defense.tacklesLoss.rank defense.tacklesLoss.value defense.passDefended.rank defense.passDefended.value offense.fumbles.value offense.fumbles.rank offense.passIntPerc.value offense.passIntPerc.rank
0  Cincinnati Bengals                       19.82                         18                        16                      39.6%              17                        9                        80                        24                         63                   NaN                  NaN                       NaN                      NaN
1   Carolina Panthers                         NaN                        NaN                       NaN                        NaN             NaN                      NaN                       NaN                       NaN                        NaN                    20                   13                       3.5                       31

We can do some nifty things to clean this up, one method might be:

df = df.set_index('team')

# Make a multi-index
df.columns = df.columns.str.split('.', expand=True)
df = df.sort_index(axis=1)
print(df)

# Output:

                        defense                               offense
                   passDefended       tacklesLoss       firstDownPerG        fumbles           g passIntPerc       thirdDownPct
                           rank value        rank value          rank  value    rank value value        rank value         rank  value
team
Cincinnati Bengals           24    63           9    80            18  19.82     NaN   NaN    17         NaN   NaN           16  39.6%
Carolina Panthers           NaN   NaN         NaN   NaN           NaN    NaN      13    20   NaN          31   3.5          NaN    NaN

Now you have a nifty dataframe that you can do many things with:

# For example, Perhaps you want just a dataframe with defense ranks:

defense_rank = df.stack([-3, -1]).xs(('defense', 'rank'), axis=0, level=[-2, -1]).dropna(axis=1)
print(defense_rank)

# Output:

                   passDefended tacklesLoss
team
Cincinnati Bengals           24           9
Answered By: BeRT2me
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.