Pandas – Calculate the Average of the Same data

Question:

I have a pandas df that has a list of item numbers and then a number next to it. I would like to somehow get the average of all the same item numbers and that number next to it.

Here is a part of the DataFrame:

Item ID        Time
X32TR2639      7.142857
X32TR2639      7.142857
X36SL7708      16.714286
X36TA0029      16.714286
X36TR3016      16.714286

Desired output:

Item ID        Average Time:
X32TR2639      7.142857
X36SL7708      16.714286
X36TA0029      16.714286
X36TR3016      16.714286

I would like for each item ID there is to have an average time however if there is more than one copy of that Item ID take the average of them all

This is only a small part of the dataframe. As you see the first two are the same. I would like to calculate the average of all of them. So if its the same use all those numbers and get that average. So the script would look for all of the item numbers X32TR2639 and get the number next to it and then get that average.

Asked By: PyMan

||

Answers:

I would propose a straightforward groupby.mean and a reset_index.

data = {"Item ID":['X32TR2639','X32TR2639','X36SL7708','X36TA0029','X36TR3016'],'time':[7.142857,7.142857,16.714286,16.714286,16.714286]}

df = pd.DataFrame(data)

df.groupby('Item ID').mean().reset_index()

     Item ID       time
0  X32TR2639   7.142857
1  X36SL7708  16.714286
2  X36TA0029  16.714286
3  X36TR3016  16.714286

Extra

I have tried with 50k of data and here’s the time performance.

df

              ID      time
0      X32TR2639  0.837810
1      X32TR2639  0.855781
2      X36SL7708  0.322786
3      X36TA0029  0.441353
4      X36TR3016  0.254487
         ...       ...
49995  X32TR2639  0.885251
49996  X32TR2639  0.315009
49997  X36SL7708  0.298589
49998  X36TA0029  0.229855
49999  X36TR3016  0.933437

[50000 rows x 2 columns]

%timeit df.groupby('ID').mean().reset_index()
4.76 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is the output dataframe after doing the groupby.mean on the 50k dataframe with duplicates.

df.groupby('ID').mean().reset_index()

          ID      time
0  X32TR2639  0.493729
1  X36SL7708  0.500936
2  X36TA0029  0.501064
3  X36TR3016  0.492773
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.