How to get the number of the most frequent value in a column?
Question:
I have a data frame and I would like to know how many times a given column has the most frequent value.
I try to do it in the following way:
items_counts = df['item'].value_counts()
max_item = items_counts.max()
As a result I get:
ValueError: cannot convert float NaN to integer
As far as I understand, with the first line I get series in which the values from a column are used as key and frequency of these values are used as values. So, I just need to find the largest value in the series and, because of some reason, it does not work. Does anybody know how this problem can be solved?
Answers:
It looks like you may have some nulls in the column. You can drop them with df = df.dropna(subset=['item'])
. Then df['item'].value_counts().max()
should give you the max counts, and df['item'].value_counts().idxmax()
should give you the most frequent value.
You may also consider using scipy’s mode
function which ignores NaN. A solution using it could look like:
from scipy.stats import mode
from numpy import nan
df = DataFrame({"a": [1,2,2,4,2], "b": [nan, nan, nan, 3, 3]})
print mode(df)
The output would look like
(array([[ 2., 3.]]), array([[ 3., 2.]]))
meaning that the most common values are 2
for the first columns and 3
for the second, with frequencies 3
and 2
respectively.
To continue to @jonathanrocher answer you could use mode
in pandas DataFrame. It’ll give a most frequent values (one or two) across the rows or columns:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": [1,2,2,4,2], "b": [np.nan, np.nan, np.nan, 3, 3]})
In [2]: df.mode()
Out[2]:
a b
0 2 3.0
Just take the first row of your items_counts
series:
top = items_counts.head(1) # or items_counts.iloc[[0]]
value, count = top.index[0], top.iat[0]
This works because pd.Series.value_counts
has sort=True
by default and so is already ordered by counts, highest count first. Extracting a value from an index by location has O(1) complexity, while pd.Series.idxmax
has O(n) complexity where n is the number of categories.
Specifying sort=False
is still possible and then idxmax
is recommended:
items_counts = df['item'].value_counts(sort=False)
top = items_counts.loc[[items_counts.idxmax()]]
value, count = top.index[0], top.iat[0]
Notice in this case you don’t need to call max
and idxmax
separately, just extract the index via idxmax
and feed to the loc
label-based indexer.
Add this line of code to find the most frequent value
df["item"].value_counts().nlargest(n=1).values[0]
The NaN values are omitted for calculating frequencies.
Please check your code functionality here
But you can use the below code for same functionality.
**>> Code:**
# Importing required module
from collections import Counter
# Creating a dataframe
df = pd.DataFrame({ 'A':["jan","jan","jan","mar","mar","feb","jan","dec",
"mar","jan","dec"] })
# Creating a counter object
count = Counter(df['A'])
# Calling a method of Counter object(count)
count.most_common(3)
**>> Output:**
[('jan', 5), ('mar', 3), ('dec', 2)]
I have a data frame and I would like to know how many times a given column has the most frequent value.
I try to do it in the following way:
items_counts = df['item'].value_counts()
max_item = items_counts.max()
As a result I get:
ValueError: cannot convert float NaN to integer
As far as I understand, with the first line I get series in which the values from a column are used as key and frequency of these values are used as values. So, I just need to find the largest value in the series and, because of some reason, it does not work. Does anybody know how this problem can be solved?
It looks like you may have some nulls in the column. You can drop them with df = df.dropna(subset=['item'])
. Then df['item'].value_counts().max()
should give you the max counts, and df['item'].value_counts().idxmax()
should give you the most frequent value.
You may also consider using scipy’s mode
function which ignores NaN. A solution using it could look like:
from scipy.stats import mode
from numpy import nan
df = DataFrame({"a": [1,2,2,4,2], "b": [nan, nan, nan, 3, 3]})
print mode(df)
The output would look like
(array([[ 2., 3.]]), array([[ 3., 2.]]))
meaning that the most common values are 2
for the first columns and 3
for the second, with frequencies 3
and 2
respectively.
To continue to @jonathanrocher answer you could use mode
in pandas DataFrame. It’ll give a most frequent values (one or two) across the rows or columns:
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": [1,2,2,4,2], "b": [np.nan, np.nan, np.nan, 3, 3]})
In [2]: df.mode()
Out[2]:
a b
0 2 3.0
Just take the first row of your items_counts
series:
top = items_counts.head(1) # or items_counts.iloc[[0]]
value, count = top.index[0], top.iat[0]
This works because pd.Series.value_counts
has sort=True
by default and so is already ordered by counts, highest count first. Extracting a value from an index by location has O(1) complexity, while pd.Series.idxmax
has O(n) complexity where n is the number of categories.
Specifying sort=False
is still possible and then idxmax
is recommended:
items_counts = df['item'].value_counts(sort=False)
top = items_counts.loc[[items_counts.idxmax()]]
value, count = top.index[0], top.iat[0]
Notice in this case you don’t need to call max
and idxmax
separately, just extract the index via idxmax
and feed to the loc
label-based indexer.
Add this line of code to find the most frequent value
df["item"].value_counts().nlargest(n=1).values[0]
The NaN values are omitted for calculating frequencies.
Please check your code functionality here
But you can use the below code for same functionality.
**>> Code:**
# Importing required module
from collections import Counter
# Creating a dataframe
df = pd.DataFrame({ 'A':["jan","jan","jan","mar","mar","feb","jan","dec",
"mar","jan","dec"] })
# Creating a counter object
count = Counter(df['A'])
# Calling a method of Counter object(count)
count.most_common(3)
**>> Output:**
[('jan', 5), ('mar', 3), ('dec', 2)]