Python SequenceMatcher Overhead – 100% CPU utilization and very slow processing

Question

I am using difflib to compare files in two directories (versions from consecutive years).
First, i am using filecmp to find files that have changed and then iteratively using difflib.SequenceMatcher to compare them and generate a html diff as explained here.

However, I find that the program is taking too long to run and python is utilizing 100% CPU. On time profiling, i found that the seqm.get_opcodes() call which is taking all the time.

Any insight would be appreciated.
Thanks !

Code:

#changed_set contains the files to be compared
for i in changed_set:
  oldLines = open(old_dir +"/" + i).read()
  newLines = open(new_dir +"/" + i).read()
  seqm = difflib.SequenceMatcher(lambda(x): x in string.whitespace, oldLines, newLines)
  opcodes = seqm.get_opcodes() #XXX: Lots of time spent in this !
  produceDiffs(seqm, opcodes)
  del seqm

Asked By: shauvik

||

Source

Answer 1

My answer is a different approach to the problem altogether: Try using a version-control system like git to investigate how the directory changed over the years.

Make a repository out of the first directory, then replace the contents with the next year’s directory and commit that as a change. (or move the .git directory to the next year’s directory, to save on copying/deleting). repeat.

Then run gitk, and you’ll be able to see what changed between any two revisions of the tree. Either just that a binary file changed, or with a diff for text files.

Answered By: Peter Cordes

Answer 2

You can also try the diff-match-patch library, in my experience it can be 10 times faster.

EDIT: Example my other answer here

from diff_match_patch import diff_match_patch

def compute_similarity_and_diff(text1, text2):
    dmp = diff_match_patch()
    dmp.Diff_Timeout = 0.0
    diff = dmp.diff_main(text1, text2, False)

    # similarity
    common_text = sum([len(txt) for op, txt in diff if op == 0])
    text_length = max(len(text1), len(text2))
    sim = common_text / text_length

    return sim, diff

Answered By: damio

Answer 3

I reccommend you try diff-match-patch which google opens srouce in github. It supports many languages, python3 version is good for you, which is very faster than difflib.

Answered By: yunzhi

Python SequenceMatcher Overhead – 100% CPU utilization and very slow processing

Question:

Answers: