More optimal way to get same lines from two files (in Python)

Question:

I wrote a Python script that takes out any duplicate sentences from a file (by converting it to a dict type and converting back to list). I have two identical files just in two different languages (call them ‘source and target), and so I need to remove all the sentences I removed from one file from the other too.

The way I do that is:

  1. Pass on all non-duplicates from the source file to a new one
  2. Find the original line number of each ‘new’ line with respect to the old file
  3. Use that line number to find the ‘good’ sentences in the target file to write to a new file too.77

Regarding step 3 specifically, which is my question, I don’t know how to do this efficiently. I’ve tried using the Linecache library, reading the whole file into memory (in my code right now) and iterating sentence by sentence. They all result in a slowdown that is simply unusable.

import argparse

tgt_list = []
parser = argparse.ArgumentParser()
parser.add_argument("--src", type=str, required=True,)
parser.add_argument("--tgt", type=str, required=True,)
args = parser.parse_args()

#read the source file
fs= open(args.src, "r+")
source = fs.readlines()

#read the target file
ft = open(args.tgt, "r+")
target = ft.readlines()

#remove duplicates from source
source = list(dict.fromkeys(source))

#retrieves the line number that the sentence appears on in the Source file
def line_num(sent):
    with open("ysdata.en", "r") as myFile:
        for num, line in enumerate(myFile, 1):
            if sent in line:
                return num


#writes the new deduplicated files to a new file
with open("newsrc", "w") as eng:
    for line in source:
        eng.write(line)


with open("newsrc", "r") as src:
    for line in src:
        # finds the sentence the match for the source sentence in the other language
        tgt_line = target[line_num(line)]
        # Then, it appends it to a list
        tgt_list.append(tgt_line)

    #writes each sentence in the list to a new target file
    with open("newtgt", "w+") as out:
        for item in tgt_list:
            out.write(str(item))

If anybody deals with Neural Machine Translation or Corpus Linguistics, what I’m trying to do is deduplication.

I have ran out of other ways to try it and I would appreciate any advice.

Regards,

Justin

Asked By: Justin Cunningham

||

Answers:

Once you have a list of all the lines that should remain in the target file, the easiest approach is probably to iterate over the target in the following manner (pseudo-code):

input: 

* good_lines /* lines that should remain in the target /*
* old target file
* new target file

Algorithm: 

* good_lines <- sorted(good_lines) # not mandatory, but better performance
* open both files. 
* line_counter = 0

* for line in old target_file: 
  * if line_counter is in good_lines:
    * write line to new target file. 
  * line_counter = line_counter + 1

That way, you run through the target file once. You can optimize it further by iterating over good_lines and the target file in parallel, and promoting the relevant iterators on each iteration.

Answered By: Roy2012
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.