pd.Grouper with datetime key in conjunction with another grouping key seemingly creates the wrong number of groups

Question:

Using pd.Grouper with a datetime key in conjunction with another key creates a set of groups, but this does not seem to encompass all of the groups that need to be created, in my opinion.

>>> test = pd.DataFrame({"id":["a","b"]*3, "b":pd.date_range("2000-01-01","2000-01-03", freq="9H")})
>>> test
    id  b
0   a   2000-01-01 00:00:00
1   b   2000-01-01 09:00:00
2   a   2000-01-01 18:00:00
3   b   2000-01-02 03:00:00
4   a   2000-01-02 12:00:00
5   b   2000-01-02 21:00:00

When I tried to create groups based on the date and id values:

>>> g = test.groupby([pd.Grouper(key='b', freq="D"), 'id'])
>>> g.groups
{(2000-01-01 00:00:00, 'a'): [0], (2000-01-02 00:00:00, 'b'): [1]}

g.groups shows only 2 groups when I expected 4 groups: both "a" and "b" for each day.

However, when I created another column based off of "b":

>>> test['date'] = test.b.dt.date
>>> g = test.groupby(['date', 'id'])
>>> g.groups
{(2000-01-01, 'a'): [0, 2], (2000-01-01, 'b'): [1], (2000-01-02, 'a'): [4], (2000-01-02, 'b'): [3, 5]}

The outcome was exactly what I expected.

I don’t know how to make sense of these different outcomes. Please enlighten me.

Asked By: VJ.BG

||

Answers:

You do have 4 groups with the Grouper, the output of g.groups is misleading (maybe worth reporting as a bug?):

g = test.groupby([pd.Grouper(key='b', freq="D"), 'id'])

g.ngroups
# 4

g.size()
# b           id
# 2000-01-01  a     2
#             b     1
# 2000-01-02  a     1
#             b     2
# dtype: int64
Answered By: mozway

I believe it is because of the difference between ‘pd.Grouper’ and the ‘dt.date’ method in pandas. ‘pd.Grouper’ groups by a range of values (e.g., daily, hourly, etc.) while ‘dt.date’ returns just the date part of a datetime object, effectively creating a categorical variable.

When you use ‘pd.Grouper’ with a frequency of "D", it will group by full days, so each day is represented by only one group. But in your case, each id has multiple records for a given day. So, ‘pd.Grouper’ is not able to capture all of the groups that you expect.

On the other hand, when you use the ‘dt.date’ method to extract the date part of the datetime, it creates a categorical variable that represents each date independently.

so when you group by this new date column along with the id column, each group will correspond to a unique combination of date and id, giving you the expected outcome.

In summary, pd.Grouper is useful when you want to group by a range of values (e.g., daily, hourly), while using a separate column for the exact values (e.g., a column for dates only) is useful when you want to group by specific values.

Answered By: Chukwu Favour
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.