Python SequenceMatcher Overhead – 100% CPU utilization and very slow processing
Question:
I am using difflib to compare files in two directories (versions from consecutive years).
First, i am using filecmp to find files that have changed and then iteratively using difflib.SequenceMatcher to compare them and generate a html diff as explained here.
However, I find that the program is taking too long to run and python is utilizing 100% CPU. On time profiling, i found that the seqm.get_opcodes() call which is taking all the time.
Any insight would be appreciated.
Thanks !
Code:
#changed_set contains the files to be compared
for i in changed_set:
oldLines = open(old_dir +"/" + i).read()
newLines = open(new_dir +"/" + i).read()
seqm = difflib.SequenceMatcher(lambda(x): x in string.whitespace, oldLines, newLines)
opcodes = seqm.get_opcodes() #XXX: Lots of time spent in this !
produceDiffs(seqm, opcodes)
del seqm
Answers:
My answer is a different approach to the problem altogether: Try using a version-control system like git to investigate how the directory changed over the years.
Make a repository out of the first directory, then replace the contents with the next year’s directory and commit that as a change. (or move the .git directory to the next year’s directory, to save on copying/deleting). repeat.
Then run gitk, and you’ll be able to see what changed between any two revisions of the tree. Either just that a binary file changed, or with a diff for text files.
You can also try the diff-match-patch
library, in my experience it can be 10 times faster.
EDIT: Example my other answer here
from diff_match_patch import diff_match_patch
def compute_similarity_and_diff(text1, text2):
dmp = diff_match_patch()
dmp.Diff_Timeout = 0.0
diff = dmp.diff_main(text1, text2, False)
# similarity
common_text = sum([len(txt) for op, txt in diff if op == 0])
text_length = max(len(text1), len(text2))
sim = common_text / text_length
return sim, diff
I reccommend you try diff-match-patch which google opens srouce in github. It supports many languages, python3 version is good for you, which is very faster than difflib.
I am using difflib to compare files in two directories (versions from consecutive years).
First, i am using filecmp to find files that have changed and then iteratively using difflib.SequenceMatcher to compare them and generate a html diff as explained here.
However, I find that the program is taking too long to run and python is utilizing 100% CPU. On time profiling, i found that the seqm.get_opcodes() call which is taking all the time.
Any insight would be appreciated.
Thanks !
Code:
#changed_set contains the files to be compared
for i in changed_set:
oldLines = open(old_dir +"/" + i).read()
newLines = open(new_dir +"/" + i).read()
seqm = difflib.SequenceMatcher(lambda(x): x in string.whitespace, oldLines, newLines)
opcodes = seqm.get_opcodes() #XXX: Lots of time spent in this !
produceDiffs(seqm, opcodes)
del seqm
My answer is a different approach to the problem altogether: Try using a version-control system like git to investigate how the directory changed over the years.
Make a repository out of the first directory, then replace the contents with the next year’s directory and commit that as a change. (or move the .git directory to the next year’s directory, to save on copying/deleting). repeat.
Then run gitk, and you’ll be able to see what changed between any two revisions of the tree. Either just that a binary file changed, or with a diff for text files.
You can also try the diff-match-patch
library, in my experience it can be 10 times faster.
EDIT: Example my other answer here
from diff_match_patch import diff_match_patch
def compute_similarity_and_diff(text1, text2):
dmp = diff_match_patch()
dmp.Diff_Timeout = 0.0
diff = dmp.diff_main(text1, text2, False)
# similarity
common_text = sum([len(txt) for op, txt in diff if op == 0])
text_length = max(len(text1), len(text2))
sim = common_text / text_length
return sim, diff
I reccommend you try diff-match-patch which google opens srouce in github. It supports many languages, python3 version is good for you, which is very faster than difflib.