GroupBy pandas DataFrame and select most common value

Question:

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.

My code:

import pandas as pd
from scipy import stats

source = pd.DataFrame({
    'Country': ['USA', 'USA', 'Russia', 'USA'], 
    'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
    'Short name': ['NY', 'New', 'Spb', 'NY']})

source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

Last line of code doesn’t work, it says KeyError: 'Short name' and if I try to group only by City, then I got an AssertionError. What can I do fix it?

Asked By: Viacheslav Nefedov

||

Answers:

For agg, the lambba function gets a Series, which does not have a 'Short name' attribute.

stats.mode returns a tuple of two arrays, so you have to take the first element of the first array in this tuple.

With these two simple changements:

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

returns

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY
Answered By: eumiro

You can use value_counts() to get a count series, and get the first row:

source.groupby(['Country','City']).agg(lambda x: x.value_counts().index[0])

In case you are wondering about performing other agg functions in the .agg(),
try this.

# Let's add a new col, "account"
source['account'] = [1, 2, 3, 3]

source.groupby(['Country','City']).agg(
    mod=('Short name', lambda x: x.value_counts().index[0]),
    avg=('account', 'mean'))
Answered By: HYRY

A little late to the game here, but I was running into some performance issues with HYRY’s solution, so I had to come up with another one.

It works by finding the frequency of each key-value, and then, for each key, only keeping the value that appears with it most often.

There’s also an additional solution that supports multiple modes.

On a scale test that’s representative of the data I’m working with, this reduced runtime from 37.4s to 0.5s!

Here’s the code for the solution, some example usage, and the scale test:

import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              'C'    ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         

    The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
    (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
    contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
    group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
    '''
    return df.groupby(key_cols + [value_col]).size() 
             .to_frame(count_col).reset_index() 
             .sort_values(count_col, ascending=False) 
             .drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        

    The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
    one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
    columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
    and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
    '''
    return df.groupby(key_cols + [value_col]).size() 
             .to_frame(count_col).reset_index() 
             .groupby(key_cols + [count_col])[value_col].unique() 
             .to_frame().reset_index() 
             .sort_values(count_col, ascending=False) 
             .drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
                    str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

Running this code will print something like:

   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
   key value  count
1    1     B      2
2    3     C      2
   key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

Hope this helps!

Answered By: abw333

Formally, the correct answer is the @eumiro Solution.
The problem of @HYRY solution is that when you have a sequence of numbers like [1,2,3,4] the solution is wrong, i. e., you don’t have the mode.
Example:

>>> import pandas as pd
>>> df = pd.DataFrame(
        {
            'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'], 
            'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 
            'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
        }
    )

If you compute like @HYRY you obtain:

>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
        total  bla
client            
A           4   30
B           4   40
C           1   10
D           3   30
E           2   20

Which is clearly wrong (see the A value that should be 1 and not 4) because it can’t handle with unique values.

Thus, the other solution is correct:

>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
        total  bla
client            
A           1   10
B           4   40
C           1   10
D           3   30
E           2   20
Answered By: nunodsousa

A slightly clumsier but faster approach for larger datasets involves getting the counts for a column of interest, sorting the counts highest to lowest, and then de-duplicating on a subset to only retain the largest cases. The code example is following:

>>> import pandas as pd
>>> source = pd.DataFrame(
        {
            'Country': ['USA', 'USA', 'Russia', 'USA'], 
            'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
            'Short name': ['NY', 'New', 'Spb', 'NY']
        }
    )
>>> grouped_df = source
        .groupby(['Country','City','Short name'])[['Short name']]
        .count()
        .rename(columns={'Short name':'count'})
        .reset_index()
        .sort_values('count', ascending=False)
        .drop_duplicates(subset=['Country', 'City'])
        .drop('count', axis=1)
>>> print(grouped_df)
  Country              City Short name
1     USA          New-York         NY
0  Russia  Sankt-Petersburg        Spb
Answered By: Dimitri

The problem here is the performance, if you have a lot of rows it will be a problem.

If it is your case, please try with this:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()
Answered By: Diego Perez Sastre

If you want another approach for solving it that is does not depend on value_counts or scipy.stats you can use the Counter collection

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

Which can be applied to the above example like this

src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

src.groupby(['Country','City']).agg(get_most_common)
Answered By: kmader

Pandas >= 0.16

pd.Series.mode is available!

Use groupby, GroupBy.agg, and apply the pd.Series.mode function to each group:

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

If this is needed as a DataFrame, use

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

The useful thing about Series.mode is that it always returns a Series, making it very compatible with agg and apply, especially when reconstructing the groupby output. It is also faster.

# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Dealing with Multiple Modes

Series.mode also does a good job when there are multiple modes:

source2 = source.append(
    pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
    ignore_index=True)

# Now `source2` has two modes for the 
# ("USA", "New-York") group, they are "NY" and "New".
source2

  Country              City Short name
0     USA          New-York         NY
1     USA          New-York        New
2  Russia  Sankt-Petersburg        Spb
3     USA          New-York         NY
4     USA          New-York        New
source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg          Spb
USA      New-York            [NY, New]
Name: Short name, dtype: object

Or, if you want a separate row for each mode, you can use GroupBy.apply:

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)

Country  City               
Russia   Sankt-Petersburg  0    Spb
USA      New-York          0     NY
                           1    New
Name: Short name, dtype: object

If you don’t care which mode is returned as long as it’s either one of them, then you will need a lambda that calls mode and extracts the first result.

source2.groupby(['Country','City'])['Short name'].agg(
    lambda x: pd.Series.mode(x)[0])

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

Alternatives to (not) consider

You can also use statistics.mode from python, but…

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

…it does not work well when having to deal with multiple modes; a StatisticsError is raised. This is mentioned in the docs:

If data is empty, or if there is not exactly one most common value,
StatisticsError is raised.

But you can see for yourself…

statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values
Answered By: cs95

The two top answers here suggest:

df.groupby(cols).agg(lambda x:x.value_counts().index[0])

or, preferably

df.groupby(cols).agg(pd.Series.mode)

However both of these fail in simple edge cases, as demonstrated here:

df = pd.DataFrame({
    'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
    'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
    'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})

The first:

df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

yields IndexError (because of the empty Series returned by group C). The second:

df.groupby(['client_id', 'date']).agg(pd.Series.mode)

returns ValueError: Function does not reduce, since the first group returns a list of two (since there are two modes). (As documented here, if the first group returned a single mode this would work!)

Two possible solutions for this case are:

import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

And the solution given to me by cs95 in the comments here:

def foo(x): 
    m = pd.Series.mode(x); 
    return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)

However, all of these are slow and not suited for large datasets. A solution I ended up using which a) can deal with these cases and b) is much, much faster, is a lightly modified version of abw33’s answer (which should be higher):

def get_mode_per_column(dataframe, group_cols, col):
    return (dataframe.fillna(-1)  # NaN placeholder to keep group 
            .groupby(group_cols + [col])
            .size()
            .to_frame('count')
            .reset_index()
            .sort_values('count', ascending=False)
            .drop_duplicates(subset=group_cols)
            .drop(columns=['count'])
            .sort_values(group_cols)
            .replace(-1, np.NaN))  # restore NaNs

group_cols = ['client_id', 'date']    
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
    output_df[col] = get_mode_per_column(df, group_cols, col)[col].values

Essentially, the method works on one col at a time and outputs a df, so instead of concat, which is intensive, you treat the first as a df, and then iteratively add the output array (values.flatten()) as a column in the df.

Answered By: Josh Friedlander

If you don’t want to include NaN values, using Counter is much much faster than pd.Series.mode or pd.Series.value_counts()[0]:

def get_most_common(srs):
    x = list(srs)
    my_counter = Counter(x)
    return my_counter.most_common(1)[0][0]

df.groupby(col).agg(get_most_common)

should work. This will fail when you have NaN values, as each NaN will be counted separately.

Answered By: irene

Use DataFrame.value_counts for fast solution

The top 3 answers here:

  • source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
  • source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
  • source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0])

are incredibly slow for large datasets.

Solution using collections.Counter is much faster (20-40 times faster than the top 3 methods)

  • source.groupby(['Country', 'City'])['Short name'].agg(lambda srs: Counter(list(srs)).most_common(1)[0][0])

but still very slow.

Solutions by abw333 and Josh Friedlander are much faster (about 10 times faster than the method using Counter). These solutions can be further optimized by using value_counts instead (DataFrame.value_counts is available since pandas 1.1.0.).

source.value_counts(['Country', 'City', 'Short name']).pipe(lambda x: x[~x.droplevel('Short name').index.duplicated()]).reset_index(name='Count')

To make the function account for NaNs like in Josh Friedlander’s function, simply turn off dropna parameter:

source.value_counts(['Country', 'City', 'Short name'], dropna=False).pipe(lambda x: x[~x.droplevel('Short name').index.duplicated()]).reset_index(name='Count')

Using abw333’s setup, if we test the runtime difference, for a DataFrame with 1mil rows, value_counts is about 10% faster than abw333’s solution.

scale_test_data = [[random.randint(1, 100),
                    str(random.randint(100, 900)), 
                    str(random.randint(0,2))] for i in range(1000000)]
source = pd.DataFrame(data=scale_test_data, columns=['Country', 'City', 'Short name'])
keys = ['Country', 'City']
vals = ['Short name']

%timeit source.value_counts(keys+vals).pipe(lambda x: x[~x.droplevel(vals).index.duplicated()]).reset_index(name='Count')
# 376 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit mode(source, ['Country', 'City'], 'Short name', 'Count')
# 415 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

For easy use, I wrapped this solution in a function that you can readily copy paste and use in your own environment. This function can find group modes of multiple columns as well.

def get_groupby_modes(source, keys, values, dropna=True, return_counts=False):
    """
    A function that groups a pandas dataframe by some of its columns (keys) and 
    returns the most common value of each group for some of its columns (values).
    The output is sorted by the counts of the first column in values (because it
    uses pd.DataFrame.value_counts internally).
    An equivalent one-liner if values is a singleton list is:
    (
        source
        .value_counts(keys+values)
        .pipe(lambda x: x[~x.droplevel(values).index.duplicated()])
        .reset_index(name=f"{values[0]}_count")
    )
    If there are multiple modes for some group, it returns the value with the 
    lowest Unicode value (because under the hood, it drops duplicate indexes in a 
    sorted dataframe), unlike, e.g. df.groupby(keys)[values].agg(pd.Series.mode).
    Must have Pandas 1.1.0 or later for the function to work and must have 
    Pandas 1.3.0 or later for the dropna parameter to work.
    -----------------------------------------------------------------------------
    Parameters:
    -----------
    source: pandas dataframe.
        A pandas dataframe with at least two columns.
    keys: list.
        A list of column names of the pandas dataframe passed as source. It is 
        used to determine the groups for the groupby.
    values: list.
        A list of column names of the pandas dataframe passed as source. 
        If it is a singleton list, the output contains the mode of each group 
        for this column. If it is a list longer than 1, then the modes of each 
        group for the additional columns are assigned as new columns.
    dropna: bool, default: True.
        Whether to count NaN values as the same or not. If True, NaN values are 
        treated by their default property, NaN != NaN. If False, NaN values in 
        each group are counted as the same values (NaN could potentially be a 
        most common value).
    return_counts: bool, default: False.
        Whether to include the counts of each group's mode. If True, the output 
        contains a column for the counts of each mode for every column in values. 
        If False, the output only contains the modes of each group for each 
        column in values.
    -----------------------------------------------------------------------------
    Returns:
    --------
    a pandas dataframe.
    -----------------------------------------------------------------------------
    Example:
    --------
    get_groupby_modes(source=df, 
                      keys=df.columns[:2].tolist(), 
                      values=df.columns[-2:].tolist(), 
                      dropna=True,
                      return_counts=False)
    """
    
    def _get_counts(df, keys, v, dropna):
        c = df.value_counts(keys+v, dropna=dropna)
        return c[~c.droplevel(v).index.duplicated()]
    
    counts = _get_counts(source, keys, values[:1], dropna)
    
    if len(values) == 1:
        if return_counts:
            final = counts.reset_index(name=f"{values[0]}_count")
        else:
            final = counts.reset_index()[keys+values[:1]]
    else:
        final = counts.reset_index(name=f"{values[0]}_count", level=values[0])
        if not return_counts:
            final = final.drop(columns=f"{values[0]}_count")
        for v in values:
            counts = _get_counts(source, keys, [v], dropna).reset_index(level=v)
            if return_counts:
                final[[v, f"{v}_count"]] = counts
            else:
                final[v] = counts[v]
        final = final.reset_index()
    return final
Answered By: cottontail

Instead of using ".agg" try ".apply" which faster and gives result across the columns.

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short name' : ['NY','New','Spb','NY']})
source.groupby(['Country', 'City'])['Short name'].apply(pd.Series.mode).reset_index()
Answered By: Adeel Afzal

To always return all modes (single or multiple alike) with .agg, you can create a function that returns the mode(s) as list.

df.agg(lambda x: x.mode().to_list())
def lmode(x): return x.mode().to_list()
df.agg(lmode)

If you prefer single modes to be returned as scalars, you can use the following function:

def lmode(x): a = x.mode(); return a.to_list() if len(a) > 1 else a.squeeze()

Benefits:

  • Returns all modes
    • Single mode as scalar and multiple modes as list
  • Works with groupby and agg
  • Can be combined with other aggregates (e.g. df.agg([lmode, 'nunique']))
  • Returns lmode instead of lambda as aggregate name
  • Doesn’t trigger error when a group by mode is np.nan but returns [] instead

Example with Multiple Aggregates

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'i': [1, 3, 2, np.nan, 3, 1],
    's': ['a', 'a', 'b', 'c', 'c', np.nan],
})

def lmode(x): a = x.mode(); return a.to_list() if len(a) > 1 else a.squeeze()

# Combined aggregates with multiple modes
print(df.agg([lmode, 'nunique']))
                  i  s
lmode    [1.0, 3.0]  a
nunique           3  4

Example from OP

source = pd.DataFrame({
    'Country': ['USA', 'USA', 'Russia', 'USA'],
    'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
    'Short name': ['NY', 'New', 'Spb', 'NY']})

source.groupby(['Country','City']).agg(lmode)
                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY
Answered By: Benjamin Ziepert