Pandas dataFrame.nunique() : ("unhashable type : 'list'", 'occured at index columns')

Question:

I want to apply the .nunique() function to a full dataFrame.

On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).

def nbDifferentValues(data):
    total = data.nunique()
    total = total.sort_values(ascending=False)
    percent = (total/data.shape[0]*100)
    return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])

diffValues = nbDifferentValues(dataFrame)

And the code fails at the first line and I get the following error which I don’t know how to solve (“unhashable type : ‘list'”, ‘occured at index columns’):
Trace of the error

Asked By: Thomas Coquereau

||

Answers:

You probably have a column whose content are lists.

Since lists in Python are mutable they are unhashable.

import pandas as pd

df = pd.DataFrame([
    (0, [1,2]),
    (1, [2,3])    
])

#  raises "unhashable type : 'list'" error
df.nunique()

SOLUTION: Don’t use mutable structures (like lists) in your dataframe:

df = pd.DataFrame([
    (0, (1,2)),
    (1, (2,3))    
])

df.nunique()

#  0    2
#  1    2
#  dtype: int64
Answered By: raul ferreira

I have come across this problem with .nunique() when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.

# this is the dataframe that is causing your issues
df = data.copy()

print(f"Rows and columns: {df.shape}  n")
print(f"Null values per column: n{df.isna().sum()}  n")

# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
    try:
        df[each_col].nunique()
        ls_cols_nunique.append(each_col)
    except:
        ls_cols_error_nunique.append(each_col)

print(f"Unique values per column: n{df[ls_cols_nunique].nunique()}  n")
print(f"Columns error nunique: n{ls_cols_error_nunique}  n")

This code should split your dataframe columns into 2 lists:

  1. Column that can calculate .nunique()
  2. Column that errors when running .nunique()

Then just calculate the .nunique() on the columns without errors.

As far as converting the columns with errors, there are other resources that address that with .apply(pd.series).

Answered By: user_stack_overflow

To get nunique or unique in a pandas.Series , my preferred approaches are

Quick Approach

NOTE: It wouldn’t hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.

_unique_items = df.COL_LIST.explode().unique()

or

_unique_count = df.COL_LIST.explode().nunique()

Alternate Approach

Alternatively, if I wish not to explode the items,

# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()

# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()

Bonus

df.COL.apply(json.dumps) might handle all the cases.

OP’s solution

df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
Answered By: miRastic
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.