group by pandas dataframe and select latest in each group
Question:
How to group values of pandas dataframe and select the latest(by date) from each group?
For example, given a dataframe sorted by date:
id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
grouping by id or product, and selecting the earliest gives:
id product date
2 220 6647 2014-10-16
5 826 3380 2015-05-19
8 901 4555 2014-11-01
Answers:
use idxmax
in groupby
and slice df
with loc
df.loc[df.groupby('id').date.idxmax()]
id product date
2 220 6647 2014-10-16
5 826 3380 2015-05-19
8 901 4555 2014-11-01
You can also use tail
with groupby to get the last n values of the group:
df.sort_values('date').groupby('id').tail(1)
id product date
2 220 6647 2014-10-16
8 901 4555 2014-11-01
5 826 3380 2015-05-19
To use .tail()
as an aggregation method and keep your grouping intact:
df.sort_values('date').groupby('id').apply(lambda x: x.tail(1))
id product date
id
220 2 220 6647 2014-10-16
826 5 826 3380 2015-05-19
901 8 901 4555 2014-11-01
I had a similar problem and ended up using drop_duplicates
rather than groupby
.
It seems to run significatively faster on large datasets when compared with other methods suggested above.
df.sort_values(by="date").drop_duplicates(subset=["id"], keep="last")
id product date
2 220 6647 2014-10-16
8 901 4555 2014-11-01
5 826 3380 2015-05-19
Given a dataframe sorted by date, you can obtain what you ask for in a number of ways:
Like this:
df.groupby(['id','product']).last()
like this:
df.groupby(['id','product']).nth(-1)
or like this:
df.groupby(['id','product']).max()
If you don’t want id
and product
to appear as index use groupby(['id', 'product'], as_index=False)
.
Alternatively use:
df.groupby(['id','product']).tail(1)
#import datetime library
from datetime import datetime as dt
#transform the date column to ordinal, or create a temp column converting to ordinal.
df['date'] = df.date.apply(lambda date: date.toordinal())
#apply aggregation function depending your desire. Earliest or Latest date.
latest_date = df.groupby('id').agg(latest=('date', max))
earliest_date = df.groupby('id').agg(earliest=('date', min))
#convert it from ordinal back to date.
df['date'] = df.date.apply(lambda date: dt.fromordinal(date))
#This operation may take seconds on millions of records.
How to group values of pandas dataframe and select the latest(by date) from each group?
For example, given a dataframe sorted by date:
id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01
grouping by id or product, and selecting the earliest gives:
id product date
2 220 6647 2014-10-16
5 826 3380 2015-05-19
8 901 4555 2014-11-01
use idxmax
in groupby
and slice df
with loc
df.loc[df.groupby('id').date.idxmax()]
id product date
2 220 6647 2014-10-16
5 826 3380 2015-05-19
8 901 4555 2014-11-01
You can also use tail
with groupby to get the last n values of the group:
df.sort_values('date').groupby('id').tail(1)
id product date
2 220 6647 2014-10-16
8 901 4555 2014-11-01
5 826 3380 2015-05-19
To use .tail()
as an aggregation method and keep your grouping intact:
df.sort_values('date').groupby('id').apply(lambda x: x.tail(1))
id product date
id
220 2 220 6647 2014-10-16
826 5 826 3380 2015-05-19
901 8 901 4555 2014-11-01
I had a similar problem and ended up using drop_duplicates
rather than groupby
.
It seems to run significatively faster on large datasets when compared with other methods suggested above.
df.sort_values(by="date").drop_duplicates(subset=["id"], keep="last")
id product date
2 220 6647 2014-10-16
8 901 4555 2014-11-01
5 826 3380 2015-05-19
Given a dataframe sorted by date, you can obtain what you ask for in a number of ways:
Like this:
df.groupby(['id','product']).last()
like this:
df.groupby(['id','product']).nth(-1)
or like this:
df.groupby(['id','product']).max()
If you don’t want id
and product
to appear as index use groupby(['id', 'product'], as_index=False)
.
Alternatively use:
df.groupby(['id','product']).tail(1)
#import datetime library
from datetime import datetime as dt
#transform the date column to ordinal, or create a temp column converting to ordinal.
df['date'] = df.date.apply(lambda date: date.toordinal())
#apply aggregation function depending your desire. Earliest or Latest date.
latest_date = df.groupby('id').agg(latest=('date', max))
earliest_date = df.groupby('id').agg(earliest=('date', min))
#convert it from ordinal back to date.
df['date'] = df.date.apply(lambda date: dt.fromordinal(date))
#This operation may take seconds on millions of records.