Drop duplicates using pandas groupby

Question

In the dataframe below, I would like to eliminate the duplicate cid values so the output from df.groupby('date').cid.size() matches the output from df.groupby('date').cid.nunique().

I have looked at this post but it does not seem to have a solid solution to the problem.

df = pd.read_csv('https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df')
df.groupby('date')['cid'].agg(['size', 'nunique'])

       size  nunique
date        
2005      7        3
2006    237       10
2007   3610      227
2008   1318       52
2009   2664      142
2010    997       57
2011   6390      219
2012   2904      99
2013    7875    238
2014    3979    146

Things I tried:

df.groupby([df['date']]).drop_duplicates(cols='cid') gives this error: AttributeError: Cannot access callable attribute 'drop_duplicates' of 'DataFrameGroupBy' objects, try using the 'apply' method
df.groupby(('date').drop_duplicates('cid')) gives this error: AttributeError: 'str' object has no attribute 'drop_duplicates'

Asked By: Collective Action

||

Source

Answer 1

You don’t need groupby to drop duplicates based on a few columns, you can specify a subset instead:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64

Answered By: ayhan

Answer 2

1. `groupby.head(1)`

The relevant groupby method to drop duplicates in each group is groupby.head(1). Note that it is important to pass 1 to select the first row of each date-cid pair.

df1 = df.groupby(['date', 'cid']).head(1)

2. `duplicated()` is more flexible

Another method is to use duplicated() to create a boolean mask and filter.

df3 = df[~df.duplicated(['date', 'cid'])]

An advantage of this method over drop_duplicates() is that is can be chained with other boolean masks to filter the dataframe more flexibly. For example, to select the unique cids in Nevada for each date, use:

df_nv = df[df['state'].eq('NV') & ~df.duplicated(['date', 'cid'])]

3. `groupby.sample(1)`

Another method to select a unique row from each group to use groupby.sample(). Unlike the previous methods mentioned, it selects a row from each group randomly (whereas the others only keep the first row from each group).

df4 = df.groupby(['date', 'cid']).sample(n=1)

You can verify that df1, df2 (ayhan’s output) and df3 all produce the very same output and df4 produces an output where size and nunique of cid match for each date (as required in the OP). In short, the following returns True.

w, x, y, z = [d.groupby('date')['cid'].agg(['size', 'nunique']) for d in (df1, df2, df3, df4)]
w.equals(x) and w.equals(y) and w.equals(z)   # True

and w, x, y, z all look like the following:

       size  nunique
date        
2005      7        3
2006    237       10
2007   3610      227
2008   1318       52
2009   2664      142
2010    997       57
2011   6390      219
2012   2904       99
2013   7875      238
2014   3979      146

Answered By: cottontail

Drop duplicates using pandas groupby

Question:

Answers:

1. `groupby.head(1)`

2. `duplicated()` is more flexible

3. `groupby.sample(1)`

Drop duplicates using pandas groupby

Question:

Answers:

1. groupby.head(1)

2. duplicated() is more flexible

3. groupby.sample(1)

1. `groupby.head(1)`

2. `duplicated()` is more flexible

3. `groupby.sample(1)`