Drop certain rows based on quantity of rows with specific values

Question:

I am newer data science and am working on a project to analyze sports statistics. I have a dataset of hockey statistics for a group of players over multiple seasons. Players have anywhere between 1 row to 12 rows representing their season statistics over however many seasons they’ve played.

Example:

    Player  Season  Pos GP  G   A   P   +/- PIM P/GP    ... PPG PPP SHG SHP OTG GWG S   S%  TOI/GP  FOW%
0   Nathan MacKinnon    2022    1   65  32  56  88  22  42  1.35    ... 7   27  0   0   1   5   299 10.7    21.07   45.4
1   Nathan MacKinnon    2021    1   48  20  45  65  22  37  1.35    ... 8   25  0   0   0   2   206 9.7 20.37   48.5
2   Nathan MacKinnon    2020    1   69  35  58  93  13  12  1.35    ... 12  31  0   0   2   4   318 11.0    21.22   43.1
3   Nathan MacKinnon    2019    1   82  41  58  99  20  34  1.21    ... 12  37  0   0   1   6   365 11.2    22.08   43.7
4   Nathan MacKinnon    2018    1   74  39  58  97  11  55  1.31    ... 12  32  0   1   3   12  284 13.7    19.90   41.9
5   Nathan MacKinnon    2017    1   82  16  37  53  -14 16  0.65    ... 2   14  2   2   2   4   251 6.4 19.95   50.6
6   Nathan MacKinnon    2016    1   72  21  31  52  -4  20  0.72    ... 7   16  0   1   0   6   245 8.6 18.87   48.4
7   Nathan MacKinnon    2015    1   64  14  24  38  -7  34  0.59    ... 3   7   0   0   0   2   192 7.3 17.05   47.0
8   Nathan MacKinnon    2014    1   82  24  39  63  20  26  0.77    ... 8   17  0   0   0   5   241 10.0    17.35   42.9
9   J.T. Compher        2022    2   70  18  15  33  6   25  0.47    ... 4   6   1   1   0   0   102 17.7    16.32   51.4
10  J.T. Compher        2021    2   48  10  8   18  10  19  0.38    ... 1   2   0   0   0   2   47  21.3    14.22   45.9
11  J.T. Compher        2020    2   67  11  20  31  9   18  0.46    ... 1   5   0   3   1   3   106 10.4    16.75   47.7
12  J.T. Compher        2019    2   66  16  16  32  -8  31  0.48    ... 4   9   3   3   0   3   118 13.6    17.48   49.2
13  J.T. Compher        2018    2   69  13  10  23  -29 20  0.33    ... 4   7   2   2   2   3   131 9.9 16.00   45.1
14  J.T. Compher        2017    2   21  3   2   5   0   4   0.24    ... 1   1   0   0   0   1   30  10.0    14.93   47.6
15  Darren Helm         2022    1   68  7   8   15  -5  14  0.22    ... 0   0   1   2   0   1   93  7.5 10.55   44.2
16  Darren Helm         2021    1   47  3   5   8   -3  10  0.17    ... 0   0   0   0   0   0   83  3.6 14.68   66.7
17  Darren Helm         2020    1   68  9   7   16  -6  37  0.24    ... 0   0   1   2   0   0   102 8.8 13.73   53.6
18  Darren Helm         2019    1   61  7   10  17  -11 20  0.28    ... 0   0   1   4   0   0   107 6.5 14.57   44.4
19  Darren Helm         2018    1   75  13  18  31  3   39  0.41    ... 0   0   2   4   0   0   141 9.2 15.57   44.1

[sample of my dataset][1]
[1]: https://i.stack.imgur.com/7CsUd.png

If any player has played more than 6 seasons, I want to drop the row corresponding to Season 2021. This is because COVID drastically shortened the season and it is causing issues as I work with averages.

As you can see from the screenshot, Nathan MacKinnon has played 9 seasons. Across those 9 seasons, except for 2021, he plays in no fewer than 64 games. Due to the shortened season of 2021, he only got 48 games.
Removing Season 2021 results in an Average Games Played of 73.75.
Keeping Season 2021 in the data, the Average Games Played becomes 70.89.

While not drastic, it compounds into the other metrics as well.

I have been trying this for a little while now, but as I mentioned, I am new to this world and am struggling to figure out how to accomplish this.

I don’t want to just completely drop ALL rows for 2021 across all players, though, as some players only have 1-5 years’ worth of data and for those players, I need to use as much data as I can and remove 1 row from a player with only 2 seasons would also negatively skew averages.

I would really appreciate some assistance from anyone more experienced than me!

Asked By: Josh

||

Answers:

This can be accomplished by using groupby and apply. For example:

edited_players = (players
                  .groupby("Player")
                  .apply(lambda subset: subset if len(subset) <= 6 else subset.query("Season != 2021"))
)

Round brackets for formatting purposes.

The combination of groupby and apply basically feeds a grouped subset of your dataframe to a function. So, first all the rows of Nathan MacKinnon will be used, then rows for J.T. Compher, then Darren Helm rows, etc.

The function used is an anonymous/lambda function which operates under the following logic: "if the dataframe subset that I receive has 6 or fewer rows, I’ll return the subset unedited. Otherwise, I will filter out rows within that subset which have the value 2021 in the Season column".

Answered By: Maurice Pasternak
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.