How to remove duplicated rows in a CSV file based on a column

Question:

I basically want to remove all rows with duplicated cells in the second column in a CSV file:

Skufnoo,222228888444,-6026769894509215039,ВупÑень пупÑень â¤ï¸â€ðŸ©¹ðŸ’—,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,4,True,False,0
mAtkmb,5213786988,4161254730445748607,ДаниÑль Блинов,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,False,False,False,0
Ethan58,222228888444,7737583697013043644,Ethan,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,4,True,False,0
sheluvjoseph,1421438213,8544915453690665435,អន សំអុល,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,5,True,False,0

and write them to a new CSV file like this:

Skufnoo,222228888444,-6026769894509215039,ВупÑень пупÑень â¤ï¸â€ðŸ©¹ðŸ’—,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,4,True,False,0
mAtkmb,5213786988,4161254730445748607,ДаниÑль Блинов,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,False,False,False,0
sheluvjoseph,1421438213,8544915453690665435,អន សំអុល,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,5,True,False,0

I have tried the following code, but it doesn’t work:

import csv

with (
    open('members.csv', 'r', encoding="utf8") as in_file,
    open('members2.csv', 'w', encoding="utf8") as out_file,
):
    writer=csv.writer(out_file)
    tracks = set()
    for row in in_file:
        key = row[1]
        if key not in tracks:
            writer.writerow(row)
            tracks.add(key)

Any help is very appreciated.

Asked By: James Black

||

Answers:

You forgot to read the input csv file with csv.reader

in_data = csv.reader(in_file, delimiter=',')

Every other lines in your code seems ok.

Complete code:

import csv

with open('members.csv', 'r', encoding="utf8") as in_file, open('members2.csv', 'w', encoding="utf8") as out_file:
    in_data = csv.reader(in_file, delimiter=',')

    writer=csv.writer(out_file)

    tracks = set()

    for row in in_data:
        key = row[1]
        if key not in tracks:
            writer.writerow(row)
            tracks.add(key)
Answered By: kritserv

If you don’t mind having the entire input CSV in memory then you could simply use a dictionary as follows:

import csv

with open("members.csv", newline="") as in_file, open("members2.csv", "w", newline="") as out_file:
    d = {row[1]: row for row in csv.reader(in_file)}
    csv.writer(out_file).writerows(d.values())

Note:

Although this fulfils the brief (of removing duplicates), the result will be different to the set() technique. Can you see why?

Answered By: SIGHUP
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.