Pandas dataFrame.nunique() : ("unhashable type : 'list'", 'occured at index columns')
Question:
I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don’t know how to solve (“unhashable type : ‘list'”, ‘occured at index columns’):
Trace of the error
Answers:
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don’t use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
I have come across this problem with .nunique()
when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} n")
print(f"Null values per column: n{df.isna().sum()} n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: n{df[ls_cols_nunique].nunique()} n")
print(f"Columns error nunique: n{ls_cols_error_nunique} n")
This code should split your dataframe columns into 2 lists:
- Column that can calculate
.nunique()
- Column that errors when running
.nunique()
Then just calculate the .nunique()
on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series)
.
To get nunique
or unique
in a pandas.Series
, my preferred approaches are
Quick Approach
NOTE: It wouldn’t hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps)
might handle all the cases.
OP’s solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don’t know how to solve (“unhashable type : ‘list'”, ‘occured at index columns’):
Trace of the error
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don’t use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
I have come across this problem with .nunique()
when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} n")
print(f"Null values per column: n{df.isna().sum()} n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: n{df[ls_cols_nunique].nunique()} n")
print(f"Columns error nunique: n{ls_cols_error_nunique} n")
This code should split your dataframe columns into 2 lists:
- Column that can calculate
.nunique()
- Column that errors when running
.nunique()
Then just calculate the .nunique()
on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series)
.
To get nunique
or unique
in a pandas.Series
, my preferred approaches are
Quick Approach
NOTE: It wouldn’t hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps)
might handle all the cases.
OP’s solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...