Pandas "Can only compare identically-labeled DataFrame objects" error

Question

I’m using Pandas to compare the outputs of two files loaded into two data frames (uat, prod):
…

uat = uat[['Customer Number','Product']]
prod = prod[['Customer Number','Product']]
print uat['Customer Number'] == prod['Customer Number']
print uat['Product'] == prod['Product']
print uat == prod

The first two match exactly:
74357    True
74356    True
Name: Customer Number, dtype: bool
74357    True
74356    True
Name: Product, dtype: bool

For the third print, I get an error:
Can only compare identically-labeled DataFrame objects. If the first two compared fine, what’s wrong with the 3rd?

Thanks

Asked By: user1804633

||

Source

Answer 1

Here’s a small example to demonstrate this (which only applied to DataFrames, not Series, until Pandas 0.19 where it applies to both):

In [1]: df1 = pd.DataFrame([[1, 2], [3, 4]])

In [2]: df2 = pd.DataFrame([[3, 4], [1, 2]], index=[1, 0])

In [3]: df1 == df2
Exception: Can only compare identically-labeled DataFrame objects

One solution is to sort the index first (Note: some functions require sorted indexes):

In [4]: df2.sort_index(inplace=True)

In [5]: df1 == df2
Out[5]: 
      0     1
0  True  True
1  True  True

Note: == is also sensitive to the order of columns, so you may have to use sort_index(axis=1):

In [11]: df1.sort_index().sort_index(axis=1) == df2.sort_index().sort_index(axis=1)
Out[11]: 
      0     1
0  True  True
1  True  True

Note: This can still raise (if the index/columns aren’t identically labelled after sorting).

Answered By: Andy Hayden

Answer 2

You can also try dropping the index column if it is not needed to compare:

print(df1.reset_index(drop=True) == df2.reset_index(drop=True))

I have used this same technique in a unit test like so:

from pandas.util.testing import assert_frame_equal

assert_frame_equal(actual.reset_index(drop=True), expected.reset_index(drop=True))

Answered By: CoreDump

Answer 3

At the time when this question was asked there wasn’t another function in Pandas to test equality, but it has been added a while ago: pandas.equals

You use it like this:

df1.equals(df2)

Some differenes to == are:

You don’t get the error described in the question
It returns a simple boolean.
NaN values in the same location are considered equal
2 DataFrames need to have the same dtype to be considered equal, see this stackoverflow question

EDIT:
As pointed out in @paperskilltrees answer index alignment is important. Apart from the solution provided there another option is to sort the index of the DataFrames before comparing the DataFrames. For df1 that would be df1.sort_index(inplace=True).

Answered By: flyingdutchman

Answer 4

When you compare two DataFrames, you must ensure that the number of records in the first DataFrame matches with the number of records in the second DataFrame. In our example, each of the two DataFrames had 4 records, with 4 products and 4 prices.

If, for example, one of the DataFrames had 5 products, while the other DataFrame had 4 products, and you tried to run the comparison, you would get the following error:

ValueError: Can only compare identically-labeled Series objects

this should work

import pandas as pd
import numpy as np

firstProductSet = {'Product1': ['Computer','Phone','Printer','Desk'],
                   'Price1': [1200,800,200,350]
                   }
df1 = pd.DataFrame(firstProductSet,columns= ['Product1', 'Price1'])


secondProductSet = {'Product2': ['Computer','Phone','Printer','Desk'],
                    'Price2': [900,800,300,350]
                    }
df2 = pd.DataFrame(secondProductSet,columns= ['Product2', 'Price2'])


df1['Price2'] = df2['Price2'] #add the Price2 column from df2 to df1

df1['pricesMatch?'] = np.where(df1['Price1'] == df2['Price2'], 'True', 'False')  #create new column in df1 to check if prices match
df1['priceDiff?'] = np.where(df1['Price1'] == df2['Price2'], 0, df1['Price1'] - df2['Price2']) #create new column in df1 for price diff 
print (df1)

example from https://datatofish.com/compare-values-dataframes/

Answered By: Shaina Raza

Answer 5

Flyingdutchman’s answer is great but wrong: it uses DataFrame.equals, which will return False in your case.
Instead, you want to use DataFrame.eq, which will return True.

It seems that DataFrame.equals ignores the dataframe’s index, while DataFrame.eq uses dataframes’ indexes for alignment and then compares the aligned values. This is an occasion to quote the central gotcha of Pandas:

Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.

As we can see in the following examples, the data alignment is neither broken, nor enforced, unless explicitly requested. So we have three different situations.

No explicit instruction given, as to the alignment: == aka DataFrame.__eq__,


   In [1]: import pandas as pd
   In [2]: df1 = pd.DataFrame(index=[0, 1, 2], data={'col1':list('abc')})
   In [3]: df2 = pd.DataFrame(index=[2, 0, 1], data={'col1':list('cab')})
   In [4]: df1 == df2
   ---------------------------------------------------------------------------
   ...
   ValueError: Can only compare identically-labeled DataFrame objects

Alignment is explicitly broken: DataFrame.equals, DataFrame.values, DataFrame.reset_index(),

    In [5]: df1.equals(df2)
    Out[5]: False

    In [9]: df1.values == df2.values
    Out[9]: 
    array([[False],
           [False],
           [False]])

    In [10]: (df1.values == df2.values).all().all()
    Out[10]: False

Alignment is explicitly enforced: DataFrame.eq, DataFrame.sort_index(),


    In [6]: df1.eq(df2)
    Out[6]: 
       col1
    0  True
    1  True
    2  True

    In [8]: df1.eq(df2).all().all()
    Out[8]: True

My answer is as of pandas version 1.0.3.

Answered By: paperskilltrees

Answer 6

Here I am showing a complete example of how to handle this error. I have added rows with zeros. You can have your dataframes from csv or any other source.

import pandas as pd
import numpy as np


# df1 with 9 rows
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
    'Age':[23,45,12,34,27,44,28,39,40]})

# df2 with 8 rows
df2 = pd.DataFrame({'Name':['John','Mike','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
    'Age':[25,45,14,34,26,44,29,42]})


# get lengths of df1 and df2
df1_len = len(df1)
df2_len = len(df2)


diff = df1_len - df2_len

rows_to_be_added1 = rows_to_be_added2 = 0
# rows_to_be_added1 = np.zeros(diff)

if diff < 0:
    rows_to_be_added1 = abs(diff)
else:
    rows_to_be_added2 = diff
    
# add empty rows to df1
if rows_to_be_added1 > 0:
    df1 = df1.append(pd.DataFrame(np.zeros((rows_to_be_added1,len(df1.columns))),columns=df1.columns))

# add empty rows to df2
if rows_to_be_added2 > 0:
    df2 = df2.append(pd.DataFrame(np.zeros((rows_to_be_added2,len(df2.columns))),columns=df2.columns))

# at this point we have two dataframes with the same number of rows, and maybe different indexes
# drop the indexes of both, so we can compare the dataframes and other operations like update etc.
df2.reset_index(drop=True, inplace=True)
df1.reset_index(drop=True, inplace=True)

# add a new column to df1
df1['New_age'] = None

# compare the Age column of df1 and df2, and update the New_age column of df1 with the Age column of df2 if they match, else None
df1['New_age'] = np.where(df1['Age'] == df2['Age'], df2['Age'], None)

# drop rows where Name is 0.0
df2 = df2.drop(df2[df2['Name'] == 0.0].index)

# now we don't get the error ValueError: Can only compare identically-labeled Series objects

Answered By: Noor Ahmed

Answer 7

I found where the error is coming from in my case:

The problem was that column names list was accidentally enclosed in another list.

Consider following example:

column_names=['warrior','eat','ok','monkeys']

df_good = pd.DataFrame(np.ones(shape=(6,4)),columns=column_names)
df_good['ok'] < df_good['monkeys']

>>> 0    False
    1    False
    2    False
    3    False
    4    False
    5    False

df_bad = pd.DataFrame(np.ones(shape=(6,4)),columns=[column_names])
df_bad ['ok'] < df_bad ['monkeys']

>>> ValueError: Can only compare identically-labeled DataFrame objects

And the thing is you cannot visually distinguish the bad DataFrame from good.

Answered By: Anonymous

Answer 8

In my case i just write directly param columns in creating dataframe, because data from one sql-query was with names, and without in other

Answered By: Solo.dmitry

Pandas "Can only compare identically-labeled DataFrame objects" error

Question:

Answers: