Compare values of two different DataFrames
Question:
I have two DataFrames, both have the same columns but one is for historic data and the other for ‘new’ data. New data may sometimes contain info that is already in historic data. So I want to say if the value of ‘comment_id’ in new data is already present in historic data, no nothing. Else, add that row to historic data.
I tried doing this:
historic_comments = [x for x in filtered_comments if filtered_comments['comment_id'] not in historic_comments['comment_id']]
But got error:
TypeError: unhashable type: ‘Series’
Answers:
Use boolean mask and isin
:
m = ~filtered_comments['comment_id'].isin(historic_comments['comment_id'])
out = pd.concat([historic_comments, filtered_comments[m]], axis=0, ignore_index=True)
Output:
>>> out # new historic_comments dataframe
comment_id
0 bonjour
1 hello
2 world
3 new
>>> filtered_comments
comment_id
0 hello
1 new
2 world
>>> historic_comments
comment_id
0 bonjour
1 hello
2 world
I think this is what you can do assuming historic_df
is old df and new_df
is new df
historic_df = pd.concat(
[historic_df, new_df.loc[~new_df["comment_id"].isin(historic_df["comment_id"])]],
ignore_index=True,
)
I have two DataFrames, both have the same columns but one is for historic data and the other for ‘new’ data. New data may sometimes contain info that is already in historic data. So I want to say if the value of ‘comment_id’ in new data is already present in historic data, no nothing. Else, add that row to historic data.
I tried doing this:
historic_comments = [x for x in filtered_comments if filtered_comments['comment_id'] not in historic_comments['comment_id']]
But got error:
TypeError: unhashable type: ‘Series’
Use boolean mask and isin
:
m = ~filtered_comments['comment_id'].isin(historic_comments['comment_id'])
out = pd.concat([historic_comments, filtered_comments[m]], axis=0, ignore_index=True)
Output:
>>> out # new historic_comments dataframe
comment_id
0 bonjour
1 hello
2 world
3 new
>>> filtered_comments
comment_id
0 hello
1 new
2 world
>>> historic_comments
comment_id
0 bonjour
1 hello
2 world
I think this is what you can do assuming historic_df
is old df and new_df
is new df
historic_df = pd.concat(
[historic_df, new_df.loc[~new_df["comment_id"].isin(historic_df["comment_id"])]],
ignore_index=True,
)