Python Pandas Only Compare Identically Labeled DataFrame Objects
Question:
I tried all the solutions here:
Pandas "Can only compare identically-labeled DataFrame objects" error
Didn’t work for me. Here’s what I’ve got. I have two data frames. One is a set of financial data that already exists in the system and another set that has some that may or may not exist in the system. I need to find the difference and add the stuff that doesn’t exist.
Here is the code:
import pandas as pd
import numpy as np
from azure.storage.blob import AppendBlobService, PublicAccess, ContentSettings
from io import StringIO
dataUrl = "http://ichart.finance.yahoo.com/table.csv?s=MSFT"
blobUrlBase = "https://pyjobs.blob.core.windows.net/"
data = pd.read_csv(dataUrl)
abs = AppendBlobService(account_name='pyjobs', account_key='***')
abs.create_container("stocks", public_access = PublicAccess.Container)
abs.append_blob_from_text('stocks', 'msft', data[:25].to_csv(index=False))
existing = pd.read_csv(StringIO(abs.get_blob_to_text('stocks', 'msft').content))
ne = (data != existing).any(1)
the failing code is the final line. I was going through an article on determining differences between data frames.
I checked the dtypes on all columns, they appear to be the same. I also did a side by side output, I sorted teh axis, the indices, dropped the indices etc. Still get that bloody error.
Here is the output of the first row of existing and data
>>> existing[:1]
Date Open High Low Close Volume Adj Close
0 2016-05-27 51.919998 52.32 51.77 52.32 17653700 52.32
>>> data[:1]
Date Open High Low Close Volume Adj Close
0 2016-05-27 51.919998 52.32 51.77 52.32 17653700 52.32
Here is the exact error I receive:
>>> ne = (data != existing).any(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:Anaconda3libsite-packagespandascoreops.py", line 1169, in f
return self._compare_frame(other, func, str_rep)
File "C:Anaconda3libsite-packagespandascoreframe.py", line 3571, in _compare_frame
raise ValueError('Can only compare identically-labeled '
ValueError: Can only compare identically-labeled DataFrame objects
Answers:
In order to get around this, you want to compare the underlying numpy arrays.
import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['One', 'Two'])
df2 = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['one', 'two'])
df1.values == df2.values
array([[ True, True],
[ True, True]], dtype=bool)
Replicated with some fake data to achieve the end goal of removing duplicates. Note this is not the answer to the original question, but what the answer was to what I was attempting to do that caused the question.
b = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
c = pd.DataFrame({'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
'B': ['B7', 'B8', 'B9', 'B10', 'B11'],
'C': ['C7', 'C8', 'C9', 'C10', 'C11'],
'D': ['D7', 'D8', 'D9', 'D10', 'D11']},
index=[7, 8, 9, 10, 11])
result = pd.concat([b,c])
idx = np.unique(result["A"], return_index=True)[1]
result.iloc[idx].sort()
If you want to compare 2 Data Frames. Check-out flexible comparison in Pandas, using the methods like .eq(), .nq(), gt() and more… –> equal, not equal and greater then.
Example:
df['new_col'] = df.gt(df_1)
http://pandas.pydata.org/pandas-docs/stable/basics.html#flexible-comparisons
I also faced the same issue and resolved it by sorting the index in both axis, before comparing two dataframes.
df1 = df1.sort_index(axis=1)
df2 = df2.sort_index(axis=1)
df1 = df1.sort_index()
df2 = df2.sort_index()
I tried all the solutions here:
Pandas "Can only compare identically-labeled DataFrame objects" error
Didn’t work for me. Here’s what I’ve got. I have two data frames. One is a set of financial data that already exists in the system and another set that has some that may or may not exist in the system. I need to find the difference and add the stuff that doesn’t exist.
Here is the code:
import pandas as pd
import numpy as np
from azure.storage.blob import AppendBlobService, PublicAccess, ContentSettings
from io import StringIO
dataUrl = "http://ichart.finance.yahoo.com/table.csv?s=MSFT"
blobUrlBase = "https://pyjobs.blob.core.windows.net/"
data = pd.read_csv(dataUrl)
abs = AppendBlobService(account_name='pyjobs', account_key='***')
abs.create_container("stocks", public_access = PublicAccess.Container)
abs.append_blob_from_text('stocks', 'msft', data[:25].to_csv(index=False))
existing = pd.read_csv(StringIO(abs.get_blob_to_text('stocks', 'msft').content))
ne = (data != existing).any(1)
the failing code is the final line. I was going through an article on determining differences between data frames.
I checked the dtypes on all columns, they appear to be the same. I also did a side by side output, I sorted teh axis, the indices, dropped the indices etc. Still get that bloody error.
Here is the output of the first row of existing and data
>>> existing[:1]
Date Open High Low Close Volume Adj Close
0 2016-05-27 51.919998 52.32 51.77 52.32 17653700 52.32
>>> data[:1]
Date Open High Low Close Volume Adj Close
0 2016-05-27 51.919998 52.32 51.77 52.32 17653700 52.32
Here is the exact error I receive:
>>> ne = (data != existing).any(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:Anaconda3libsite-packagespandascoreops.py", line 1169, in f
return self._compare_frame(other, func, str_rep)
File "C:Anaconda3libsite-packagespandascoreframe.py", line 3571, in _compare_frame
raise ValueError('Can only compare identically-labeled '
ValueError: Can only compare identically-labeled DataFrame objects
In order to get around this, you want to compare the underlying numpy arrays.
import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['One', 'Two'])
df2 = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['one', 'two'])
df1.values == df2.values
array([[ True, True],
[ True, True]], dtype=bool)
Replicated with some fake data to achieve the end goal of removing duplicates. Note this is not the answer to the original question, but what the answer was to what I was attempting to do that caused the question.
b = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
c = pd.DataFrame({'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
'B': ['B7', 'B8', 'B9', 'B10', 'B11'],
'C': ['C7', 'C8', 'C9', 'C10', 'C11'],
'D': ['D7', 'D8', 'D9', 'D10', 'D11']},
index=[7, 8, 9, 10, 11])
result = pd.concat([b,c])
idx = np.unique(result["A"], return_index=True)[1]
result.iloc[idx].sort()
If you want to compare 2 Data Frames. Check-out flexible comparison in Pandas, using the methods like .eq(), .nq(), gt() and more… –> equal, not equal and greater then.
Example:
df['new_col'] = df.gt(df_1)
http://pandas.pydata.org/pandas-docs/stable/basics.html#flexible-comparisons
I also faced the same issue and resolved it by sorting the index in both axis, before comparing two dataframes.
df1 = df1.sort_index(axis=1)
df2 = df2.sort_index(axis=1)
df1 = df1.sort_index()
df2 = df2.sort_index()