Equality in Pandas DataFrames – Column Order Matters?
Question:
As part of a unit test, I need to test two DataFrames for equality. The order of the columns in the DataFrames is not important to me. However, it seems to matter to Pandas:
import pandas
df1 = pandas.DataFrame(index = [1,2,3,4])
df2 = pandas.DataFrame(index = [1,2,3,4])
df1['A'] = [1,2,3,4]
df1['B'] = [2,3,4,5]
df2['B'] = [2,3,4,5]
df2['A'] = [1,2,3,4]
df1 == df2
Results in:
Exception: Can only compare identically-labeled DataFrame objects
I believe the expression df1 == df2
should evaluate to a DataFrame containing all True
values. Obviously it’s debatable what the correct functionality of ==
should be in this context. My question is: Is there a Pandas method that does what I want? That is, is there a way to do equality comparison that ignores column order?
Answers:
You could sort the columns using sort_index
:
df1.sort_index(axis=1) == df2.sort_index(axis=1)
This will evaluate to a dataframe of all True
values.
As @osa comments this fails for NaN’s and isn’t particularly robust either, in practise using something similar to @quant’s answer is probably recommended (Note: we want a bool rather than raise if there’s an issue):
def my_equal(df1, df2):
from pandas.util.testing import assert_frame_equal
try:
assert_frame_equal(df1.sort_index(axis=1), df2.sort_index(axis=1), check_names=True)
return True
except (AssertionError, ValueError, TypeError): perhaps something else?
return False
def equal( df1, df2 ):
""" Check if two DataFrames are equal, ignoring nans """
return df1.fillna(1).sort_index(axis=1).eq(df2.fillna(1).sort_index(axis=1)).all().all()
The most common intent is handled like this:
def assertFrameEqual(df1, df2, **kwds ):
""" Assert that two dataframes are equal, ignoring ordering of columns"""
from pandas.util.testing import assert_frame_equal
return assert_frame_equal(df1.sort_index(axis=1), df2.sort_index(axis=1), check_names=True, **kwds )
Of course see pandas.util.testing.assert_frame_equal
for other parameters you can pass
Sorting column only works if the row and column labels match across the frames. Say, you have 2 dataframes with identical values in cells but with different labels,then the sort solution will not work. I ran into this scenario when implementing k-modes clustering using pandas.
I got around it with a simple equals function to check cell equality(code below)
def frames_equal(df1,df2) :
if not isinstance(df1,DataFrame) or not isinstance(df2,DataFrame) :
raise Exception(
"dataframes should be an instance of pandas.DataFrame")
if df1.shape != df2.shape:
return False
num_rows,num_cols = df1.shape
for i in range(num_rows):
match = sum(df1.iloc[i] == df2.iloc[i])
if match != num_cols :
return False
return True
have you tried using df1.equals(df2)? i think it’s more reliable that df1==df2, though i’m not sure if it will resolve your issues with column order.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html
Usually you’re going to want speedy tests and the sorting method can be brutally inefficient for larger indices (like if you were using rows instead of columns for this problem). The sort method is also susceptible to false negatives on non-unique indices.
Fortunately, pandas.util.testing.assert_frame_equal
has since been updated with a check_like
option. Set this to true and the ordering will not be considered in the test.
With non-unique indices, you’ll get the cryptic ValueError: cannot reindex from a duplicate axis
. This is raised by the under-the-hood reindex_like
operation that rearranges one of the DataFrames to match the other’s order. Reindexing is much faster than sorting as evidenced below.
import pandas as pd
from pandas.util.testing import assert_frame_equal
df = pd.DataFrame(np.arange(1e6))
df1 = df.sample(frac=1, random_state=42)
df2 = df.sample(frac=1, random_state=43)
%timeit -n 1 -r 5 assert_frame_equal(df1.sort_index(), df2.sort_index())
## 5.73 s ± 329 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%timeit -n 1 -r 5 assert_frame_equal(df1, df2, check_like=True)
## 1.04 s ± 237 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
For those who enjoy a good performance comparison plot:
Reindexing vs sorting on int and str indices (str even more drastic)
When working with dataframes containing python objects such as tuples and lists df.eq(df2)
and df == df2
will not suffice. Even if a the same cells in each dataframes contain the same object, such as (0, 0)
, the equality comparison will result to False
. To get around this, convert all columns to strings before comparison:
df.apply(lambda x: x.astype(str)).eq(df2.apply(lambda x: x.astype(str)))
Probably you may need function to compare DataFrames ignoring both row and column order? Only requirement is to have some unique column to use it as index.
f1 = pd.DataFrame([
{"id": 1, "foo": "1", "bar": None},
{"id": 2, "foo": "2", "bar": 2},
{"id": 3, "foo": "3", "bar": 3},
{"id": 4, "foo": "4", "bar": 4}
])
f2 = pd.DataFrame([
{"id": 3, "foo": "3", "bar": 3},
{"id": 1, "bar": None, "foo": "1",},
{"id": 2, "foo": "2", "bar": 2},
{"id": 4, "foo": "4", "bar": 4}
])
def comparable(df, index_col='id'):
return df.fillna(value=0).set_index(index_col).to_dict('index')
comparable(f1) == comparable (f2) # returns True
assert_frame_equal
from pandas.testing
is a function which checks frames equality.
as mentioned in assert_frame_equal
documentation, if set check_like
parameter to True
, it will ignore order of index and columns.
One simple way to do it is
(Assuming df1 and df2)
Check df1 and df2 columns are the same using a set (as order doesn’t matter in a set)
columns_check = set(df1.columns) == set(df2.columns)
If they are different then the dataframe must be different.
If they are the same then you can use the columns of df2 to reorder the columns of df1
contents_check = df1[df2.columns].equals(df2)
The easiest would be:
if set(df1.columns) == set(df2.columns):
equality_check = df1[df2.columns].equals(df2)
else:
equality_check = False
or you could even:
equality_check = False
if set(df1.columns) == set(df2.columns):
equality_check = df1[df2.columns].equals(df2)
If you don’t like that else
chilling
but you can split out the columns check, into it’s own variable (columns_check
) if you think it improves readability
If you try df1[df2.columns].equals(df2)
without first checking the columns you could get an error (if df2
has columns that df1
doesn’t) or something untrue (if df1
has columns that df2
doesn’t)
As part of a unit test, I need to test two DataFrames for equality. The order of the columns in the DataFrames is not important to me. However, it seems to matter to Pandas:
import pandas
df1 = pandas.DataFrame(index = [1,2,3,4])
df2 = pandas.DataFrame(index = [1,2,3,4])
df1['A'] = [1,2,3,4]
df1['B'] = [2,3,4,5]
df2['B'] = [2,3,4,5]
df2['A'] = [1,2,3,4]
df1 == df2
Results in:
Exception: Can only compare identically-labeled DataFrame objects
I believe the expression df1 == df2
should evaluate to a DataFrame containing all True
values. Obviously it’s debatable what the correct functionality of ==
should be in this context. My question is: Is there a Pandas method that does what I want? That is, is there a way to do equality comparison that ignores column order?
You could sort the columns using sort_index
:
df1.sort_index(axis=1) == df2.sort_index(axis=1)
This will evaluate to a dataframe of all True
values.
As @osa comments this fails for NaN’s and isn’t particularly robust either, in practise using something similar to @quant’s answer is probably recommended (Note: we want a bool rather than raise if there’s an issue):
def my_equal(df1, df2):
from pandas.util.testing import assert_frame_equal
try:
assert_frame_equal(df1.sort_index(axis=1), df2.sort_index(axis=1), check_names=True)
return True
except (AssertionError, ValueError, TypeError): perhaps something else?
return False
def equal( df1, df2 ):
""" Check if two DataFrames are equal, ignoring nans """
return df1.fillna(1).sort_index(axis=1).eq(df2.fillna(1).sort_index(axis=1)).all().all()
The most common intent is handled like this:
def assertFrameEqual(df1, df2, **kwds ):
""" Assert that two dataframes are equal, ignoring ordering of columns"""
from pandas.util.testing import assert_frame_equal
return assert_frame_equal(df1.sort_index(axis=1), df2.sort_index(axis=1), check_names=True, **kwds )
Of course see pandas.util.testing.assert_frame_equal
for other parameters you can pass
Sorting column only works if the row and column labels match across the frames. Say, you have 2 dataframes with identical values in cells but with different labels,then the sort solution will not work. I ran into this scenario when implementing k-modes clustering using pandas.
I got around it with a simple equals function to check cell equality(code below)
def frames_equal(df1,df2) :
if not isinstance(df1,DataFrame) or not isinstance(df2,DataFrame) :
raise Exception(
"dataframes should be an instance of pandas.DataFrame")
if df1.shape != df2.shape:
return False
num_rows,num_cols = df1.shape
for i in range(num_rows):
match = sum(df1.iloc[i] == df2.iloc[i])
if match != num_cols :
return False
return True
have you tried using df1.equals(df2)? i think it’s more reliable that df1==df2, though i’m not sure if it will resolve your issues with column order.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html
Usually you’re going to want speedy tests and the sorting method can be brutally inefficient for larger indices (like if you were using rows instead of columns for this problem). The sort method is also susceptible to false negatives on non-unique indices.
Fortunately, pandas.util.testing.assert_frame_equal
has since been updated with a check_like
option. Set this to true and the ordering will not be considered in the test.
With non-unique indices, you’ll get the cryptic ValueError: cannot reindex from a duplicate axis
. This is raised by the under-the-hood reindex_like
operation that rearranges one of the DataFrames to match the other’s order. Reindexing is much faster than sorting as evidenced below.
import pandas as pd
from pandas.util.testing import assert_frame_equal
df = pd.DataFrame(np.arange(1e6))
df1 = df.sample(frac=1, random_state=42)
df2 = df.sample(frac=1, random_state=43)
%timeit -n 1 -r 5 assert_frame_equal(df1.sort_index(), df2.sort_index())
## 5.73 s ± 329 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%timeit -n 1 -r 5 assert_frame_equal(df1, df2, check_like=True)
## 1.04 s ± 237 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
For those who enjoy a good performance comparison plot:
Reindexing vs sorting on int and str indices (str even more drastic)
When working with dataframes containing python objects such as tuples and lists df.eq(df2)
and df == df2
will not suffice. Even if a the same cells in each dataframes contain the same object, such as (0, 0)
, the equality comparison will result to False
. To get around this, convert all columns to strings before comparison:
df.apply(lambda x: x.astype(str)).eq(df2.apply(lambda x: x.astype(str)))
Probably you may need function to compare DataFrames ignoring both row and column order? Only requirement is to have some unique column to use it as index.
f1 = pd.DataFrame([
{"id": 1, "foo": "1", "bar": None},
{"id": 2, "foo": "2", "bar": 2},
{"id": 3, "foo": "3", "bar": 3},
{"id": 4, "foo": "4", "bar": 4}
])
f2 = pd.DataFrame([
{"id": 3, "foo": "3", "bar": 3},
{"id": 1, "bar": None, "foo": "1",},
{"id": 2, "foo": "2", "bar": 2},
{"id": 4, "foo": "4", "bar": 4}
])
def comparable(df, index_col='id'):
return df.fillna(value=0).set_index(index_col).to_dict('index')
comparable(f1) == comparable (f2) # returns True
assert_frame_equal
from pandas.testing
is a function which checks frames equality.
as mentioned in assert_frame_equal
documentation, if set check_like
parameter to True
, it will ignore order of index and columns.
One simple way to do it is
(Assuming df1 and df2)
Check df1 and df2 columns are the same using a set (as order doesn’t matter in a set)
columns_check = set(df1.columns) == set(df2.columns)
If they are different then the dataframe must be different.
If they are the same then you can use the columns of df2 to reorder the columns of df1
contents_check = df1[df2.columns].equals(df2)
The easiest would be:
if set(df1.columns) == set(df2.columns):
equality_check = df1[df2.columns].equals(df2)
else:
equality_check = False
or you could even:
equality_check = False
if set(df1.columns) == set(df2.columns):
equality_check = df1[df2.columns].equals(df2)
If you don’t like that else
chilling
but you can split out the columns check, into it’s own variable (columns_check
) if you think it improves readability
If you try df1[df2.columns].equals(df2)
without first checking the columns you could get an error (if df2
has columns that df1
doesn’t) or something untrue (if df1
has columns that df2
doesn’t)