How to group and count rows by month and year using Pandas?
Question:
I have a dataset with personal data such as name, height, weight and date of birth. I would build a graph with the number of people born in a particular month and year. I’m using python pandas to accomplish this and my strategy was to try to group by year and month and add using count. But the closest I got is to get the count of people by year or by month but not by both.
df['birthdate'].groupby(df.birthdate.dt.year).agg('count')
Other questions in stackoverflow point to a Grouper called TimeGrouper but searching in pandas documentation found nothing. Any idea?
Answers:
To group on multiple criteria, pass a list of the columns or criteria:
df['birthdate'].groupby([df.birthdate.dt.year, df.birthdate.dt.month]).agg('count')
Example:
In [165]:
df = pd.DataFrame({'birthdate':pd.date_range(start=dt.datetime(2015,12,20),end=dt.datetime(2016,3,1))})
df.groupby([df['birthdate'].dt.year, df['birthdate'].dt.month]).agg({'count'})
Out[165]:
birthdate
count
birthdate birthdate
2015 12 12
2016 1 31
2 29
3 1
UPDATE
As of version 0.23.0
the above code no longer works due to the restriction that multi-index level names must be unique, you now need to rename
the levels in order for this to work:
In[107]:
df.groupby([df['birthdate'].dt.year.rename('year'), df['birthdate'].dt.month.rename('month')]).agg({'count'})
Out[107]:
birthdate
count
year month
2015 12 12
2016 1 31
2 29
3 1
Another solution is to set birthdate
as the index and resample:
import pandas as pd
df = pd.DataFrame({'birthdate': pd.date_range(start='20-12-2015', end='3-1-2016')})
df.set_index('birthdate').resample('MS').size()
Output:
birthdate
2015-12-01 12
2016-01-01 31
2016-02-01 29
2016-03-01 1
Freq: MS, dtype: int64
You can also use the “monthly” period with to_period
with the dt
accessor:
In [11]: df = pd.DataFrame({'birthdate': pd.date_range(start='20-12-2015', end='3-1-2016')})
In [12]: df['birthdate'].groupby(df.birthdate.dt.to_period("M")).agg('count')
Out[12]:
birthdate
2015-12 12
2016-01 31
2016-02 29
2016-03 1
Freq: M, Name: birthdate, dtype: int64
It’s worth noting if the datetime is the index (rather than a column) you can use resample
:
df.resample("M").count()
As of April 2019: This will work. Pandas version – 0.24.x
df.groupby([df.dates.dt.year.rename('year'), df.dates.dt.month.rename('month')]).size()
Replace date and count fields with your respective column names. This piece of code will group, sum and sort based on the given parameters. You can also change the frequency to 1M or 2M and so on…
df[['date', 'count']].groupby(pd.Grouper(key='date', freq='1M')).sum().sort_values(by='date', ascending=True)['count']
I have a dataset with personal data such as name, height, weight and date of birth. I would build a graph with the number of people born in a particular month and year. I’m using python pandas to accomplish this and my strategy was to try to group by year and month and add using count. But the closest I got is to get the count of people by year or by month but not by both.
df['birthdate'].groupby(df.birthdate.dt.year).agg('count')
Other questions in stackoverflow point to a Grouper called TimeGrouper but searching in pandas documentation found nothing. Any idea?
To group on multiple criteria, pass a list of the columns or criteria:
df['birthdate'].groupby([df.birthdate.dt.year, df.birthdate.dt.month]).agg('count')
Example:
In [165]:
df = pd.DataFrame({'birthdate':pd.date_range(start=dt.datetime(2015,12,20),end=dt.datetime(2016,3,1))})
df.groupby([df['birthdate'].dt.year, df['birthdate'].dt.month]).agg({'count'})
Out[165]:
birthdate
count
birthdate birthdate
2015 12 12
2016 1 31
2 29
3 1
UPDATE
As of version 0.23.0
the above code no longer works due to the restriction that multi-index level names must be unique, you now need to rename
the levels in order for this to work:
In[107]:
df.groupby([df['birthdate'].dt.year.rename('year'), df['birthdate'].dt.month.rename('month')]).agg({'count'})
Out[107]:
birthdate
count
year month
2015 12 12
2016 1 31
2 29
3 1
Another solution is to set birthdate
as the index and resample:
import pandas as pd
df = pd.DataFrame({'birthdate': pd.date_range(start='20-12-2015', end='3-1-2016')})
df.set_index('birthdate').resample('MS').size()
Output:
birthdate
2015-12-01 12
2016-01-01 31
2016-02-01 29
2016-03-01 1
Freq: MS, dtype: int64
You can also use the “monthly” period with to_period
with the dt
accessor:
In [11]: df = pd.DataFrame({'birthdate': pd.date_range(start='20-12-2015', end='3-1-2016')})
In [12]: df['birthdate'].groupby(df.birthdate.dt.to_period("M")).agg('count')
Out[12]:
birthdate
2015-12 12
2016-01 31
2016-02 29
2016-03 1
Freq: M, Name: birthdate, dtype: int64
It’s worth noting if the datetime is the index (rather than a column) you can use resample
:
df.resample("M").count()
As of April 2019: This will work. Pandas version – 0.24.x
df.groupby([df.dates.dt.year.rename('year'), df.dates.dt.month.rename('month')]).size()
Replace date and count fields with your respective column names. This piece of code will group, sum and sort based on the given parameters. You can also change the frequency to 1M or 2M and so on…
df[['date', 'count']].groupby(pd.Grouper(key='date', freq='1M')).sum().sort_values(by='date', ascending=True)['count']