I have to compare to csv files, which I need to drop the duplicate rows and generate another file.
#here I´m comparing the csv files. The oldest_file and the newest_file different_data_type = newest_file.equals(other = oldest_file)
#If they have differences, I concat them to drop those rows that are equals merged_files = pd.concat([oldest_file, newest_file]) merged_files = merged_files.drop_duplicates() print(merged_files())
Each csv file has about 5.000 rows, and when I print merged_files, I´m receiving a 10.000 row csv file. In other words, it´s not dropping.
How can I get only the rows that has differences?
I think you are missing to indicate columns in
drop_duplicates(), try using like
One other way is to find duplicates in your merged file and then delete them from merged_files:
duplicate_rows = merged_files.duplicated(subset=['column1', 'column2']) merged_files = merged_files[~duplicate_rows]