How to count unique records by two columns in pandas?

Question:

I have dataframe in pandas:

In [10]: df
Out[10]:
    col_a    col_b  col_c  col_d
0  France    Paris      3      4
1      UK    Londo      4      5
2      US  Chicago      5      6
3      UK  Bristol      3      3
4      US    Paris      8      9
5      US   London     44      4
6      US  Chicago     12      4

I need to count unique cities. I can count unique states

In [11]: df['col_a'].nunique()
Out[11]: 3

and I can try to count unique cities

In [12]: df['col_b'].nunique()
Out[12]: 5

but it is wrong because US Paris and Paris in France are different cities. So now I’m doing in like this:

In [13]: df['col_a_b'] = df['col_a'] + ' - ' + df['col_b']

In [14]: df
Out[14]:
    col_a    col_b  col_c  col_d         col_a_b
0  France    Paris      3      4  France - Paris
1      UK    Londo      4      5      UK - Londo
2      US  Chicago      5      6    US - Chicago
3      UK  Bristol      3      3    UK - Bristol
4      US    Paris      8      9      US - Paris
5      US   London     44      4     US - London
6      US  Chicago     12      4    US - Chicago

In [15]: df['col_a_b'].nunique()
Out[15]: 6

Maybe there is a better way? Without creating an additional column.

Asked By: GhostKU

||

Answers:

You can select col_a and col_b, drop the duplicates, then check the shape/len of the result data frame:

df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 6

len(df[['col_a', 'col_b']].drop_duplicates())
# 6

Because groupby ignore NaNs, and may unnecessarily invoke a sorting process, choose accordingly which method to use if you have NaNs in the columns:

Consider a data frame as following:

df = pd.DataFrame({
    'col_a': [1,2,2,pd.np.nan,1,4],
    'col_b': [2,2,3,pd.np.nan,2,pd.np.nan]
})

print(df)

#   col_a  col_b
#0    1.0    2.0
#1    2.0    2.0
#2    2.0    3.0
#3    NaN    NaN
#4    1.0    2.0
#5    4.0    NaN

Timing:

df = pd.concat([df] * 1000)

%timeit df.groupby(['col_a', 'col_b']).ngroups
# 1000 loops, best of 3: 625 µs per loop

%timeit len(df[['col_a', 'col_b']].drop_duplicates())
# 1000 loops, best of 3: 1.02 ms per loop

%timeit df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 1000 loops, best of 3: 1.01 ms per loop    

%timeit len(set(zip(df['col_a'],df['col_b'])))
# 10 loops, best of 3: 56 ms per loop

%timeit len(df.groupby(['col_a', 'col_b']))
# 1 loop, best of 3: 260 ms per loop

Result:

df.groupby(['col_a', 'col_b']).ngroups
# 3

len(df[['col_a', 'col_b']].drop_duplicates())
# 5

df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 5

len(set(zip(df['col_a'],df['col_b'])))
# 2003

len(df.groupby(['col_a', 'col_b']))
# 2003

So the difference:

Option 1:

df.groupby(['col_a', 'col_b']).ngroups

is fast, and it excludes rows that contain NaNs.

Option 2 & 3:

len(df[['col_a', 'col_b']].drop_duplicates())
df[['col_a', 'col_b']].drop_duplicates().shape[0]

Reasonably fast, it considers NaNs as a unique value.

Option 4 & 5:

len(set(zip(df['col_a'],df['col_b']))) 
len(df.groupby(['col_a', 'col_b'])) 

slow, and it is following the logic that numpy.nan == numpy.nan is False, so different (nan, nan) rows are considered different.

Answered By: Psidom
In [105]: len(df.groupby(['col_a', 'col_b']))
Out[105]: 6

By using ngroups

df.groupby(['col_a', 'col_b']).ngroups
Out[101]: 6

Or using set

len(set(zip(df['col_a'],df['col_b'])))
Out[106]: 6
Answered By: BENY

try this, I’m basically subtracting the number of duplicate groups from the number of rows in df. This is assuming we are grouping all the categories in the df

df.shape[0] - df[['col_a','col_b']].duplicated().sum()

774 µs ± 603 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Answered By: Aks
import pandas as pd
data = {'field1':[1,4,1,68,9],'field2':[1,1,4,5,9]}
df = pd.DataFrame(data)
results = df.groupby('field1')['field2'].nunique()
results

Output:

field1
1     2
4     1
9     1
68    1
Name: field2, dtype: int64
Answered By: Girish Bhasin
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.