pandas: how do I select first row in each GROUP BY group?

Question:

Basically the same as Select first row in each GROUP BY group? only in pandas.

df = pd.DataFrame({'A' : ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar'],
                'B' : ['3', '1', '2', '4','2', '4', '1', '3'],
                    })

Sorting looks promising:

df.sort('B')

     A  B
1  foo  1
6  bar  1
2  foo  2
4  bar  2
0  foo  3
7  bar  3
3  foo  4
5  bar  4

But then first won’t give the desired result…
df.groupby(‘A’).first()

     B
A     
bar  2
foo  3
Asked By: ihadanny

||

Answers:

Generally if you want your data sorted in a groupby but it’s not one of the columns which are going to be grouped on then it’s better to sort the df prior to performing groupby:

In [5]:
df.sort_values('B').groupby('A').first()

Out[5]:
     B
A     
bar  1
foo  1
Answered By: EdChum

The pandas groupby function could be used for what you want, but it’s really meant for aggregation. This is a simple ‘take the first’ operation.

What you actually want is the pandas drop_duplicates function, which by default will return the first row. What you usually would consider the groupby key, you should pass as the subset= variable

df.drop_duplicates(subset='A')

Should do what you want.

Also, df.sort('A') does not sort the DataFrame df, it returns a copy which is sorted. If you want to sort it, you have to add the inplace=True parameter.

df.sort('A', inplace=True)
Answered By: firelynx

Here’s an alternative approach using groupby().rank():

df[ df.groupby('A')['B'].rank() == 1 ]

     A  B
1  foo  1
6  bar  1

This gives you the same answer as @EdChum’s for the OP’s sample dataframe, but could give a different answer if you have any ties during the sort, for example, with data like this:

df = pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'], 
                   'B': ['2', '1', '1', '1'] })

In this case you have some options using the optional method argument, depending on how you wish to handle sorting ties:

df[ df.groupby('A')['B'].rank(method='average') == 1 ]   # the default
df[ df.groupby('A')['B'].rank(method='min')     == 1 ]
df[ df.groupby('A')['B'].rank(method='first')   == 1 ]   # doesn't work, not sure why
Answered By: JohnE

Use a sort method and then keep the first value using .drop_duplicates() with its keep argument set to 'first' (default). This approach has the benefit that it keeps your index.

Example:

df.sort_values('B', ascending=True) 
  .drop_duplicates('A', keep='first')

Output:

    A   B
1   foo 1
6   bar 1

You don’t need a GroupBy in this case because there is no need to run a calculation on each group (see: split-apply-combine pattern).

Answered By: fpersyn

EdChum’s answer may not always work as intended. Instead of first() use nth(0).

The method first() is affected by this bug that has gone unsolved for some years now. Instead of the expected behaviour, first() returns the first element that is not missing in each column within each group i.e. it ignores NaN values. For example, say you had a third column with some missing values:

df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'bar', 'bar'],
                   'B' : ['1', '2','2', '4', '1'],
                   'C' : [np.nan, 'X', 'Y', 'Y', 'Y']})

    A   B   C
0   foo 1   NaN
2   foo 2   X
3   bar 2   Y
4   bar 4   Y
5   bar 1   Y

Using first() here (after sorting, just like EdChum correctly assessed in their answer) will skip over the missing values (note how it is mixing up values from different rows):

df.sort_values('B').groupby('A').first()

    B   C
A       
bar 1   Y
foo 1   X

The correct way to get the full row, including missing values, is to use nth(0), which performs the expected operation:

df.sort_values('B').groupby('A').nth(0)

    B   C
A       
bar 1   Y
foo 1   NaN

For completeness, this bug also affects last(), its correct substitute being nth(-1).

Posting this in an answer since it’s too long for a comment. Not sure this is within the scope of the question but I think it’s relevant to many people looking for this answer (like myself before writing this) and is extremely easy to miss.

Answered By: Dídac Fernández
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.