Comparing all contents of two files

Question:

I am trying to compare two files. One file has a list of stores. The other list has the same list of stores, except it is missing a few from a filter I had run against it from another script. I would like to compare these two files, if the store in file 1 is not anywhere to be located in file 2, I want to print it out, or append to a list, not too picky on that part. Below are examples of partials in both files:

file 1:

Store: 00377
Main number: 8033056238

Store: 00525
Main number: 4075624470

Store: 00840
Main number: 4782736996

Store: 00920
Main number: 4783337031

Store: 00998
Main number: 9135631751

Store: 02226
Main number: 3107501983

Store: 02328
Main number: 8642148700

Store: 02391
Main number: 7272645342

Store: 02392
Main number: 9417026237

Store: 02393
Main number: 4057942724

File 2:

00377
00525
00840
00920
00998
02203
02226
02328
02391
02392
02393
02394
02395
02396
02397
02406
02414
02425
02431
02433
02442

Here is what I built to try and make this work, but it just keeps spewing all stores in the file.

def comparesitestest():
    with open("file_1.txt", "r") as pairsin:
        pairs = pairsin.readlines()
        pairsin.close
    with open("file_2.txt", "r") as storesin:
        stores = storesin.readlines()
        storesin.close        
    for pair in pairs:
        for store in stores:
            if store not in pair:
                print(store)
Asked By: mitchell france

||

Answers:

You are getting the output you get because your check is not checking what you want. Try changing your for loop to something like this:

for pairline in pairs:
    if pairline:
        name, number = pairline.split(': ')
        if name == "Store":
            if number not in stores:
                print(number)

Explanation is as follows:
You start with a File 1 of pairs, and a File 2 of stores (store numbers, really). Your file 2 is in decent shape. After you read it in, you’ve got a list of store numbers. You don’t need to put that through a second loop. In fact, it’s wasteful and unnecessary.

Your File 1 is a little more complicated. Although you refer to the info as pairs, it’s a little more complicated than that, because the lines have a store number and what I assume is a phone number. So, for each line in the File 1, I would check if the line starts with "Store:", knowing I can ignore all the other lines. If the line starts with "Store;", the next part of the line is the store number I actually want to check for in the list of File 2.

So, the program above does a little more checking to see if it’s reading in a line it needs to act on. and then it acts on it if necessary by checking whether the store number is in the store number list.

Also, as a side note, it’s great to use the with structure. It’s good coding practice. But when you do that, you do not need to explicitly close the file. That happens automatically with that context structure. Once you leave the context, the close happens automatically.

As another side note, there are usually multiple good ways and bad ways to solve a problem. Another possible reasonable solution/version is:

for pairline in pairs:
    if pairline and pairline.startswith("Store:"):
        store = pairline.split()[1]
        if store not in stores:
           print(stores)

It’s different. Not necessarily better or worse, just different.

Answered By: GaryMBloom

When you read your first file, add the store number to a set.

store_nums_1 = set()
with open("file_1.txt") as f:
    for line in f:
        line = line.strip() # Remove trailing whitespace
        if line.startswith("Store"):
            store_nums_1.add(line[7:]) # Add only store number to set

Next, read the other file and add those numbers to another set

store_nums_2 = set()
with open("file_2.txt") as f:
    for line in f:
        line = line.strip() # Remove trailing whitespace
        store_nums_2.add(line) # The entire line is the store number, so no need to slice.

Finally, find the set difference between the two sets.

file1_extras = store_nums_1 - store_nums_2

Which gives a set containing only the store numbers in file 1 but not in file 2. (I changed your file_2 to have only the first three lines, because the file you’ve shown actually contains more store numbers than file_1, so the result file1_extras was empty using your input)

{'00920', '00998', '02226', '02328', '02391', '02392', '02393'}

This is more efficient than using lists, because checking if something exists in a list is an O(N) operation. When you do it once for each of the M items in your first list, you end up with an O(N*M) operation. On the other hand, membership checks in a set are O(1), so the entire set-difference operation is O(M) instead of O(N*M)

Answered By: Pranav Hosangadi
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.