How to drop_duplicates in python

Question

I have to compare to csv files, which I need to drop the duplicate rows and generate another file.

#here I´m comparing the csv files. The oldest_file and the newest_file
different_data_type = newest_file.equals(other = oldest_file)

#If they have differences, I concat them to drop those rows that are equals
merged_files = pd.concat([oldest_file, newest_file])
        
merged_files = merged_files.drop_duplicates()
print(merged_files())

Each csv file has about 5.000 rows, and when I print merged_files, I´m receiving a 10.000 row csv file. In other words, it´s not dropping.

How can I get only the rows that has differences?

Asked By: Matheus

||

Source

Answer 1

I think you are missing to indicate columns in drop_duplicates(), try using like

df.drop_duplicates(subset=['column1', 'column2'])

One other way is to find duplicates in your merged file and then delete them from merged_files:

duplicate_rows = merged_files.duplicated(subset=['column1', 'column2'])
merged_files = merged_files[~duplicate_rows]

Answered By: godot

How to drop_duplicates in python

Question:

Answers: