ValueError: Length of values does not match length of index | Pandas DataFrame.unique()

Question:

I am trying to get a new dataset, or change the value of the current dataset columns to their unique values.
Here is an example of what I am trying to get :

   A B
 -----
0| 1 1
1| 2 5
2| 1 5
3| 7 9
4| 7 9
5| 8 9

Wanted Result    Not Wanted Result
       A B              A B
     -----             -----
    0| 1 1           0| 1 1
    1| 2 5           1| 2 5
    2| 7 9           2| 
    3| 8             3| 7 9
                     4|
                     5| 8

I don’t really care about the index but it seems to be the problem.
My code so far is pretty simple, I tried 2 approaches, 1 with a new dataFrame and one without.

#With New DataFrame
def UniqueResults(dataframe):
    df = pd.DataFrame()
    for col in dataframe:
        S=pd.Series(dataframe[col].unique())
        df[col]=S.values
    return df

#Without new DataFrame
def UniqueResults(dataframe):
    for col in dataframe:
        dataframe[col]=dataframe[col].unique()
    return dataframe

Both times, I get the error:

Length of Values does not match length of index
Asked By: Mayeul sgc

||

Answers:

The error comes up when you are trying to assign a list of numpy array of different length to a data frame, and it can be reproduced as follows:

A data frame of four rows:

df = pd.DataFrame({'A': [1,2,3,4]})

Now trying to assign a list/array of two elements to it:

df['B'] = [3,4]   # or df['B'] = np.array([3,4])

Both errors out:

ValueError: Length of values does not match length of index

Because the data frame has four rows but the list and array has only two elements.

Work around Solution (use with caution): convert the list/array to a pandas Series, and then when you do assignment, missing index in the Series will be filled with NaN:

df['B'] = pd.Series([3,4])

df
#   A     B
#0  1   3.0
#1  2   4.0
#2  3   NaN          # NaN because the value at index 2 and 3 doesn't exist in the Series
#3  4   NaN

For your specific problem, if you don’t care about the index or the correspondence of values between columns, you can reset index for each column after dropping the duplicates:

df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))

#   A     B
#0  1   1.0
#1  2   5.0
#2  7   9.0
#3  8   NaN
Answered By: Psidom

One way to get around this issue is to keep the unique values in a list and use itertools.zip_longest to transpose the data and pass it into the DataFrame constructor:

from itertools import zip_longest
def UniqueResults(dataframe):
    tmp = [dataframe[col].unique() for col in dataframe]
    return pd.DataFrame(zip_longest(*tmp), columns=dataframe.columns)

out = UniqueResults(df)

Output:

   A    B
0  1  1.0
1  2  5.0
2  7  9.0
3  8  NaN

At least for small DataFrames, this seems to be faster (for example on OP’s sample):

%timeit out = df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.27 ms ± 50.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit x = UniqueResults(df)
426 µs ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Answered By: user7864386

Another simple solution is to make the solution suggested by OP into a working one. We just need to cast the unique values of each column into a pandas Series:

df1 = df.apply(lambda col: pd.Series(col.unique()))
df1

result

Answered By: cottontail
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.