diff list of multiline strings with difflib without knowing which were added, deleted or modified

Question:

I have two lists of multiline strings and I try to get the the diff lines for these strings. First I tried to just split all lines of each string and handled all these strings as one big "file" and get the diff for it but I had a lot of bugs. I cannot just diff by index since I do not know, which multiline string was added, which was deleted and which one was modified.

Lets say I had the following example:

import difflib
oldList = ["onentwonthree","fournfivensix","sevenneightnnine"]
newList = ["fournfiftynsix","sevenneightnnine","tennelevenntwelve"]
oldAllTogether = []
for string in oldList:
    oldAllTogether.extend(string.splitlines())
newAllTogether = []
for string in newList:
    newAllTogether.extend(string.splitlines())
diff = difflib.unified_diff(oldAllTogether,newAllTogether)

So I somehow have to find out, which strings belong to each other.

Asked By: Moe

||

Answers:

I had to implmenent my own code in order to get the desired output. It is basically the same as Differ.compare() with the difference that we have a look at multiline blocks instead of lines. So the code would be:

diffString = ""
oldList = ["onentwonthree","fournfivensix","sevenneightnnine"]
newList = ["fournfiftynsix","sevenneightnnine","tennelevenntwelve"]
a = oldList
b = newList
cruncher = difflib.SequenceMatcher(None, a, b)
for tag, alo, ahi, blo, bhi in cruncher.get_opcodes():
    if tag == 'replace':
        best_ratio, cutoff = 0.74, 0.75
        oldstrings = a[alo:ahi]
        newstrings = b[blo:bhi]
        for j in range(len(newstrings)):
            newstring = newstrings[j]
            cruncher.set_seq2(newstring)
            for i in range(len(oldstrings)):
                oldstring = oldstrings[i]
                cruncher.set_seq1(oldstring)
                if cruncher.real_quick_ratio() > best_ratio and 
                  cruncher.quick_ratio() > best_ratio and 
                  cruncher.ratio() > best_ratio:
                    best_ratio, best_old, best_new = cruncher.ratio(), i, j
            if best_ratio < cutoff:
                #added string
                stringLines = newstring.splitlines()
                for line in stringLines: diffString += "+" + line + "n"
            else:
                #replaced string
                start = False
                for diff in difflib.unified_diff(oldstrings[best_old].splitlines(),newstrings[best_new].splitlines()):
                    if start:
                        diffString += diff + "n"
                    if diff[0:2] == '@@':
                        start = True
                del oldstrings[best_old]
        #deleted strings
        stringLines = []
        for string in oldstrings:
            stringLines.extend(string.splitlines())
        for line in stringLines: diffString += "-" + line + "n"
    elif tag == 'delete':
        stringLines = []
        for string in a[alo:ahi]:
            stringLines.extend(string.splitlines())
        for line in stringLines: 
            diffString += "-" + line + "n"
    elif tag == 'insert':
        stringLines = []
        for string in b[blo:bhi]:
            stringLines.extend(string.splitlines())
        for line in stringLines: 
            diffString += "+" + line + "n"
    elif tag == 'equal':
        continue
    else:
        raise ValueError('unknown tag %r' % (tag,))

which result in the following:

print(diffString)
 four
-five
+fifty
 six
-one
-two
-three
+ten
+eleven
+twelve
Answered By: Moe
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.