When is it appropriate to use df.value_counts() vs df.groupby('…').count()?
Question:
I’ve heard in Pandas there’s often multiple ways to do the same thing, but I was wondering –
If I’m trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count()
and when does it make sense to use df['colA'].value_counts()
?
Answers:
There is difference value_counts
return:
The resulting object will be in descending order so that the first element is the most frequently-occurring element.
but count
not, it sort output by index
(created by column in groupby('col')
).
df.groupby('colA').count()
is for aggregate all columns of df
by function count.
So it count values excluding NaN
s.
So if need count
only one column need:
df.groupby('colA')['colA'].count()
Sample:
df = pd.DataFrame({'colB':list('abcdefg'),
'colC':[1,3,5,7,np.nan,np.nan,4],
'colD':[np.nan,3,6,9,2,4,np.nan],
'colA':['c','c','b','a',np.nan,'b','b']})
print (df)
colA colB colC colD
0 c a 1.0 NaN
1 c b 3.0 3.0
2 b c 5.0 6.0
3 a d 7.0 9.0
4 NaN e NaN 2.0
5 b f NaN 4.0
6 b g 4.0 NaN
print (df['colA'].value_counts())
b 3
c 2
a 1
Name: colA, dtype: int64
print (df.groupby('colA').count())
colB colC colD
colA
a 1 1 1
b 3 2 2
c 2 2 1
print (df.groupby('colA')['colA'].count())
colA
a 1
b 3
c 2
Name: colA, dtype: int64
Groupby
and value_counts
are totally different functions. You cannot perform value_counts on a dataframe.
Value Counts
are limited only for a single column or series and it’s sole purpose is to return the series of frequencies of values
Groupby
returns a object so one can perform statistical computations over it. So when you do df.groupby(col).count()
it will return the number of true values present in columns with respect to the specific columns
in groupby.
When should be value_counts
used and when should groupby.count
be used :
Lets take an example
df = pd.DataFrame({'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]})
Groupby count:
df.groupby('color').count()
id size
color
b 2 2
g 2 2
r 3 3
Groupby count is generally used for getting the valid number of values
present in all the columns with reference to
or with respect to
one
or more columns specified. So not a number (nan) will be excluded.
To find the frequency using groupby you need to aggregate against the specified column itself like @jez did. (maybe to avoid this and make developers life easy value_counts is implemented ).
Value Counts:
df['color'].value_counts()
r 3
g 2
b 2
Name: color, dtype: int64
Value count is generally used for finding the frequency of the values
present in one particular column.
In conclusion :
.groupby(col).count()
should be used when you want to find the frequency of valid values present in columns with respect to specified col
.
.value_counts()
should be used to find the frequencies of a series.
in simple words: .value_counts()
Return a Series containing counts of unique rows in the DataFrame which means it counts up the individual values in a specific row and reports how many of the values are in the column:
imagine we have a dataframe like:
df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],
'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})
first_name middle_name
0 John Smith
1 Anne <NA>
2 John <NA>
3 Beth Louise
then we apply value_counts on it:
df.value_counts()
first_name middle_name
Beth Louise 1
John Smith 1
dtype: int64
as you can see it didn’t count rows with NA values.
however count()
count non-NA cells for each column or row.
in our example:
df.count()
first_name 4
middle_name 2
dtype: int64
There are a lot of good answers here, but I just wanted to add a more concise one:
df.value_counts('col') # and its syntactic twin df['col'].value_counts()
is exactly identical to
df.groupby('col')['col'].count().sort_values(ascending=False)
Both approaches have some additional keyword parameters, but as I see it, the gist is that the former is pretty much just syntactic sugar for the latter, when you want to return a Series of the counts of each distinct item in df[col]
in descending order.
The reasons to use groupby(...).count()
are when you want to be able to count across multiple columns, or as part of a more complex aggregation.
I’ve heard in Pandas there’s often multiple ways to do the same thing, but I was wondering –
If I’m trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count()
and when does it make sense to use df['colA'].value_counts()
?
There is difference value_counts
return:
The resulting object will be in descending order so that the first element is the most frequently-occurring element.
but count
not, it sort output by index
(created by column in groupby('col')
).
df.groupby('colA').count()
is for aggregate all columns of df
by function count.
So it count values excluding NaN
s.
So if need count
only one column need:
df.groupby('colA')['colA'].count()
Sample:
df = pd.DataFrame({'colB':list('abcdefg'),
'colC':[1,3,5,7,np.nan,np.nan,4],
'colD':[np.nan,3,6,9,2,4,np.nan],
'colA':['c','c','b','a',np.nan,'b','b']})
print (df)
colA colB colC colD
0 c a 1.0 NaN
1 c b 3.0 3.0
2 b c 5.0 6.0
3 a d 7.0 9.0
4 NaN e NaN 2.0
5 b f NaN 4.0
6 b g 4.0 NaN
print (df['colA'].value_counts())
b 3
c 2
a 1
Name: colA, dtype: int64
print (df.groupby('colA').count())
colB colC colD
colA
a 1 1 1
b 3 2 2
c 2 2 1
print (df.groupby('colA')['colA'].count())
colA
a 1
b 3
c 2
Name: colA, dtype: int64
Groupby
and value_counts
are totally different functions. You cannot perform value_counts on a dataframe.
Value Counts
are limited only for a single column or series and it’s sole purpose is to return the series of frequencies of values
Groupby
returns a object so one can perform statistical computations over it. So when you do df.groupby(col).count()
it will return the number of true values present in columns with respect to the specific columns
in groupby.
When should be value_counts
used and when should groupby.count
be used :
Lets take an example
df = pd.DataFrame({'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]})
Groupby count:
df.groupby('color').count()
id size
color
b 2 2
g 2 2
r 3 3
Groupby count is generally used for getting the valid number of values
present in all the columnswith reference to
orwith respect to
one
or more columns specified. So not a number (nan) will be excluded.
To find the frequency using groupby you need to aggregate against the specified column itself like @jez did. (maybe to avoid this and make developers life easy value_counts is implemented ).
Value Counts:
df['color'].value_counts()
r 3
g 2
b 2
Name: color, dtype: int64
Value count is generally used for finding the frequency of the values
present in one particular column.
In conclusion :
.groupby(col).count()
should be used when you want to find the frequency of valid values present in columns with respect to specified col
.
.value_counts()
should be used to find the frequencies of a series.
in simple words: .value_counts()
Return a Series containing counts of unique rows in the DataFrame which means it counts up the individual values in a specific row and reports how many of the values are in the column:
imagine we have a dataframe like:
df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],
'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})
first_name middle_name
0 John Smith
1 Anne <NA>
2 John <NA>
3 Beth Louise
then we apply value_counts on it:
df.value_counts()
first_name middle_name
Beth Louise 1
John Smith 1
dtype: int64
as you can see it didn’t count rows with NA values.
however count()
count non-NA cells for each column or row.
in our example:
df.count()
first_name 4
middle_name 2
dtype: int64
There are a lot of good answers here, but I just wanted to add a more concise one:
df.value_counts('col') # and its syntactic twin df['col'].value_counts()
is exactly identical to
df.groupby('col')['col'].count().sort_values(ascending=False)
Both approaches have some additional keyword parameters, but as I see it, the gist is that the former is pretty much just syntactic sugar for the latter, when you want to return a Series of the counts of each distinct item in df[col]
in descending order.
The reasons to use groupby(...).count()
are when you want to be able to count across multiple columns, or as part of a more complex aggregation.