How to drop_duplicates in python


I have to compare to csv files, which I need to drop the duplicate rows and generate another file.

#here I´m comparing the csv files. The oldest_file and the newest_file
different_data_type = newest_file.equals(other = oldest_file) 
#If they have differences, I concat them to drop those rows that are equals
merged_files = pd.concat([oldest_file, newest_file])
merged_files = merged_files.drop_duplicates()

Each csv file has about 5.000 rows, and when I print merged_files, I´m receiving a 10.000 row csv file. In other words, it´s not dropping.

How can I get only the rows that has differences?

Asked By: Matheus



I think you are missing to indicate columns in drop_duplicates(), try using like

df.drop_duplicates(subset=['column1', 'column2'])

One other way is to find duplicates in your merged file and then delete them from merged_files:

duplicate_rows = merged_files.duplicated(subset=['column1', 'column2'])
merged_files = merged_files[~duplicate_rows]
Answered By: godot
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.