How can repetitive rows of data be collected in a single row in pandas?

Question:

I have a dataset that contains the NBA Player’s average statistics per game. Some player’s statistics are repeated because of they’ve been in different teams in season.

For example:

      Player       Pos  Age Tm    G     GS   MP      FG
8   Jarrett Allen   C   22  TOT  28     10  26.2     4.4
9   Jarrett Allen   C   22  BRK  12     5   26.7     3.7
10  Jarrett Allen   C   22  CLE  16     5   25.9     4.9

I want to average Jarrett Allen’s stats and put them into a single row. How can I do this?

Asked By: Berkay

||

Answers:

You can groupby and use agg to get the mean. For the non numeric columns, let’s take the first value:

df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
                          for k,v in df.dtypes[1:].items()})

output:

              Pos  Age   Tm          G        GS         MP        FG
Player                                                               
Jarrett Allen   C   22  TOT  18.666667  6.666667  26.266667  4.333333

NB. content of the dictionary comprehension:

{'Pos': 'first',
 'Age': 'mean',
 'Tm': 'first',
 'G': 'mean',
 'GS': 'mean',
 'MP': 'mean',
 'FG': 'mean'}
Answered By: mozway
x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]

import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')

df2 = df.groupby(['name', 'age']).mean()
print(df2)

Output:

  name  age  score
0    a   12      5
1    a   12      7
2    b   15     10
3    b   15     12
4    c   20      1
-----------
          score
name age       
a    12       6
b    15      11
c    20       1
Answered By: S2L

Option 1

If one considers the dataframe that OP shares in the question df the following will do the work

df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())

[Out]:
              Pos   Age   Tm          G        GS         MP        FG
Player                                                                
Jarrett Allen   C  22.0  TOT  18.666667  6.666667  26.266667  4.333333

This one uses:

Let’s test it with a new dataframe, df2, with more elements in the Player column.

import numpy as np

df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
                    'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
                    'Age': np.random.randint(0, 100, 9),
                    'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
                    'G': np.random.randint(0, 100, 9),
                    'GS': np.random.randint(0, 100, 9),
                    'MP': np.random.uniform(0, 100, 9),
                    'FG': np.random.uniform(0, 100, 9)})

[Out]:
          Player Pos  Age   Tm   G  GS         MP         FG
0   John Collins  PF   71  ATL  75  39  16.123225  77.949756
1   John Collins  PF   60  ATL  49  49  30.308092  24.788401
2   John Collins  PF   52  ATL  33  92  11.087317  58.488575
3     Trae Young  PG   72  ATL  20  91  62.862313  60.169282
4     Trae Young  PG   85  ATL  61  77  30.248551  85.169038
5   Clint Capela   C   73  ATL   5  67  45.817690  21.966777
6  Jarrett Allen   C   23  TOT  60  51  93.076624  34.160823
7  Jarrett Allen   C   12  BRK   2  77  74.318568  78.755869
8  Jarrett Allen   C   44  CLE  82  81   7.375631  40.930844

If one tests the operation on df2, one will get the following

df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())

[Out]:
              Pos        Age   Tm          G         GS         MP         FG
Player                                                                       
Clint Capela    C  95.000000  ATL  30.000000  98.000000  46.476398  17.987104
Jarrett Allen   C  60.000000  TOT  48.666667  19.333333  70.050540  33.572896
John Collins   PF  74.333333  ATL  50.333333  52.666667  78.181457  78.152235
Trae Young     PG  57.500000  ATL  44.500000  47.500000  46.602543  53.835455

Option 2

Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows

df_new3 = df.groupby('Player').mean()

[Out]:

                Age          G        GS         MP        FG
Player                                                       
Jarrett Allen  22.0  18.666667  6.666667  26.266667  4.333333

Notes:

  • The output of this previous operation won’t display non-numerical columns (apart from the Player name).
Answered By: Gonçalo Peres