Compare two csv files in Python and return matching results in new csv file without duplicates

Question:

I have two csv files, one called web_file with 25,000 lines, the other file called inv_file contains 320,000 lines.

I need to read through each row from column 1 of web_file and find all matching values from each row in column 1 of inv_file and write the row from inv_file into a new csv file.

Using example files with only 5-10 lines doesn’t show the issue as well, so I’ve listed a bunch of random numbers for example below.

Example web_file:

Inv_SKU,Web_SKU,Brand,Barcode
225481-34,225481-34,brand1,987654321
0486592,0486592,brand2,654871233
AB56412,AB56412,brand2,651273214
LL-123456,LL-123456,brand3,748912349
JLPD-65,JLPD-65,brand6,341541648
20143966,20143966,brand3,82193714
39585824,39585824,brand5,36837329
78066099,78066099,brand4,98398987
44381051,44381051,brand1,9090428
86529443,86529443,brand4,6861670
DF 5645 12,DF 5645 12,brand1,489456138
9845671325,9845671325,brand4,498451315
59634923,59634923,brand4,35828574
85290760,85290760,brand2,64562216
41217184,41217184,brand4,12816236
AE48915,AE48915,brand1,342536125
93981723,93981723,brand2,58155601

Example inv_file:

Inv_SKU,Web_SKU,Brand,Barcode
0486592,0486592,brand2,654871233
LL-123456,LL-123456,brand3,748912349
9845671325,9845671325,brand4,498451315
OI3248967,OI3248967,brand2,891513211
AB56412,AB56412,brand2,651273214
DF 5645 12,DF 5645 12,brand1,489456138
225481-34,225481-34,brand1,987654321
123456789,123456789,brand5,654986413
9841531,9841531,brand3,543254512
AE48915,AE48915,brand1,342536125
JLPD-65,JLPD-65,brand6,341541648
MMMM,MMMM,brand7,384941542
23481-4323,23481-4323,brand3,489123157
98451321,98451321,brand4,498121354
23454152,23454152,brand2,894165123
10275690,10275690,brand2,25612670
20143966,20143966,brand3,82193714
59634923,59634923,brand4,35828574
65800253,65800253,brand5,72318134
67722613,67722613,brand6,93290033
92617199,92617199,brand7,95078073
15379652,15379652,brand1,56281224
85290760,85290760,brand2,64562216
78066099,78066099,brand4,98398987
41217184,41217184,brand4,12816236
87152990,87152990,brand4,95058925
73813369,73813369,brand1,2395994
50201544,50201544,brand1,9167830
93981723,93981723,brand2,58155601
39585824,39585824,brand5,36837329
29082963,29082963,brand3,23393947
23856043,23856043,brand8,57295562
74249006,74249006,brand8,83219065
94376071,94376071,brand8,94887004
14553763,14553763,brand8,14223230
44381051,44381051,brand1,9090428
7598085,7598085,brand1,48967969
56383025,56383025,brand2,68864452
44338055,44338055,brand4,47043853
86529443,86529443,brand4,6861670

I tried using this code, but ended up with many duplicated lines, which I want to avoid since the file I am actually using is so large I end up with millions of lines.

with open('inv_file.csv', 'r') as f1, open('web_file.csv', 'r') as f2:
    inv_file = f1.readlines()
    web_file = f2.readlines()


with open('result.csv', 'r+') as f3:
    result_file = f3.readlines()

    while len(result_file) < len(web_file):
        for row in inv_file:
            for row1 in web_file:
                if row[0] in row1[0]:
                    f3.write(row1)
        break
Asked By: Brad Kake

||

Answers:

The while loop seems confused and unnecessary. Why are you not just doing the simple obvious thing?

import csv

with open('inv_file.csv', 'r') as f1, 
     open('web_file.csv', 'r') as f2, 
     open('result.csv', 'a') as f3:
  inv = [x[0] for x in csv.reader(f1)]
  writer = csv.writer(f3)
  for row in csv.reader(f2):
    if row[0] in inv:
        writer.writerow(row)

Demo: https://ideone.com/g6j2lB

It’s not clear why you used 'r+' mode for the output file or whether you expected us to also suppress output lines for rows which are already in the file. If that’s a requirement of yours, perhaps ask a new question with more details and this (or another) solution to the problem you actually asked about here.

Answered By: tripleee

You should really be parsing csv files using the csv library. One approach would be to store a list of the web skus (hopefully I’ve got this right way round) and then check the inv skus against it. This can be done efficiently with a generator passed to the csv writerows() method.

import csv
with open('inv_file.csv', 'r') as f1, open('web_file.csv', 'r') as f2, open('result.csv', 'w') as f3:
    web_skus = [row[0] for row in csv.reader(f2)]
    # web_skus = set([row[0] for row in csv.reader(f2)])  # uncomment to remove dupliate web skus
    inv_file = csv.reader(f1)
    rows = (row for row in inv_file if row[0] in web_skus)

    writer = csv.writer(f3)
    writer.writerows(rows)
Answered By: bn_ln

I have two ideas that could solve your problem.

Number 1:
Adding a check if row1 is in result_file before writing

if row[0] in row1[0]:
    if row1 not in result_file:
        f3.write(row1)
        

Be aware that this will take more time the more values you parsed already.

Number 2:
Adding row1 to a set after writing and checking if row1 is in this set before writing

written = set()
...
if row[0] in row1[0]:
    if row1 not in written:
        f3.write(row1)
        written.add(row1)

This version might be quicker (I’m not sure) but has higher storage needs, because all the rows are in the set and in the result file.

If it’s possible to just compare the SKU numbers, you could also use them in both cases which should be quicker and in case 2 should also take less storage.

Answered By: Sn3nS

I’m going to call web_file your filter CSV and inv_file your input CSV.

I mocked up a filter CSV with 25_000 rows and an input CSV with 320_000 rows. I then tried the approach of adding all the filter IDs to a list, then looping over the input rows and checking if each input ID was in that filter list, and if it was then writing to output.

import csv

with open("filter.csv", newline="") as f_in:
    reader = csv.reader(f_in)
    next(reader)  # discard header

    filter_ids: list[str] = []
    for row in reader:
        filter_ids.append(row[0])


with (
    open("input.csv", newline="") as f_in,
    open("output.csv", "w", newline="") as f_out,
):
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)

    writer.writerow(next(reader))

    for row in reader:
        if row[0] in filter_ids:
            writer.writerow(row)

That took about 70 seconds to run.

The program has to make at most 25_000 x 320_000 = 8_000_000_000 ("8 billion") comparisons. We can get that down to only 320_000 comparisons by using a dict to hold the filter IDs.

...
    ...
    filter_ids: dict[str, None] = {}
    for row in reader:
        filter_ids[row[0]] = None

We don’t have to change the actual filtering of input, the same if row[0] in filter_ids: syntax works for the dict.

That took .13 seconds to run, over 500X faster. Looking up keys in a dict is waayyyy faster than checking if an item is in a list, in general, and especially for big lists. The dict approach used about 3MB more memory over the list approach on my machine.

You mentioned duplicate rows in the output. I don’t see duplicate rows in the sample input, but if you need to check to make sure that an ID is not duplicated in the output, you can use a dict again:

...
    ...
    output_ids: dict[str, None] = {}
    for row in reader:
        id_ = row[0]
        if id_ not in output_ids and id_ in filter_ids:
            writer.writerow(row)
            output_ids[id_] = None
Answered By: Zach Young