What is the most efficient way to identify text similarity between items in large lists of strings in Python?

Question:

The following piece of code achieves the results I’m trying to achieve. There is a list of strings called ‘lemmas’ that contains the accepted forms of a specific class of words. The other list, called ‘forms’ contains a lot of spelling variations of words found in a large amount of texts from different periods and different dialects of a specific language. For each one of the words in ‘forms’, I want to get the string in ‘lemmas’ that is the closest match.

The script, as I said, seems to work well with some test lists I’ve constructed. The problem I have, though, is that when I use the real lists, which are rather large, it takes forever to produce the results. In fact, I have had to stop the execution of the program because it was taking already more than two hours and the computer was becoming very slow and I couldn’t do anything else.

What could I do to make this more efficient? How would I have to modify the code using other Python tools or libraries to make this faster? Thanks in advance.

    import textdistance
    from textdistance import hamming
    from textdistance import cosine
    from textdistance import jaro_winkler
    import heapq
    
    # 'lemmas' is a list containing a huge amount of words, basically dictionary entries
    # 'forms' is a huge list of spelling variations of words found in hundreds of texts
    
    distances = {}
    
    processed_pairs = set() # keep track of processed pairs
    for lemma in lemmas:
        if lemma is None:
            continue
        lemma_lower = lemma.lower()
        for form in forms:
            if form is None:
                continue        
            form_lower = form.lower()
            pair = (lemma_lower, form_lower) # create a tuple with the lowercase pair
            if pair not in processed_pairs: # check if the pair has been processed before
                processed_pairs.add(pair)
                if textdistance.hamming.normalized_similarity(lemma_lower, form_lower) > 0.34 and textdistance.jaro_winkler(lemma_lower, form_lower) > 0.7 and textdistance.cosine(lemma_lower, form_lower) > 0.5:             
                    dist = hamming.normalized_similarity(lemma_lower, form_lower)
                    distances.setdefault(form_lower, []).append((dist, lemma_lower))
    
    # Find the closest pairs
    closest_pairs = {}
    for form, dist_lemmas in distances.items():
        closest_pairs[form] = heapq.nsmallest(2, dist_lemmas)
    
    with open(ROOT / 'potential_lemmas.txt', 'w') as f:
        for form, pairs in closest_pairs.items():
            for dist, lemma in pairs:
                f.write(f"{form} ➝  {lemma}: {dist}n")
             

EDIT:

In the end, the solution that worked the best was an integration of @Kyle F Hartzenberg’s proposal with @Jamie_B suggestion of using joblib to parallelize (see comments after code, though):

from itertools import zip_longest
from bisect import insort
from joblib import Parallel, delayed
import line_profiler

profile = line_profiler.LineProfiler()

emmas = ['gran', 'vermell', 'groc', 'atens', 'Do', 'dOne', 'PUrpose', 'can', 'be', 'use', 'for', 'cannon', 'amuse', 'useful', 'user', 'become', 'downtown', 'develop', 'fulminate', 'deduce', 'de', 'bezant']

forms = ['preriarenos', 'Marinara', 'Grand', 'Gran', 'Grans', 'Grands', 'Grandeses', 'Grandullons', 'grand', 'grandissisimus', 'gran', 'grans', 'grands', 'grandeses', 'grandullons', 'grandullon', 'grandullones', 'uermell', 'uermells', 'vermell', 'vermells', 'vermella', 'vermelles', 'varmellíssimes', 'uarmellíssimes', 'uermellíssimes', 'uarnellíssimes', 'varmellíssima', 'uermella', 'uarmella', 'uarnella', 'varnella', 'uarnellas', 'varnellas', 'varmella', 'uermelles', 'grog', 'grogues', 'doNE', 'donE', 'doIng', 'purposeful', 'canonical', 'becareful', 'being', 'berate', 'best', 'bezant', 'full', 'fulmination', 'predict', 'downgrade', 'down', 'developing', 'deduct', 'deducing']

distances = {}

@delayed
def calc_distances(form, lemmas_low):
    form_distances = []
    for lemma in lemmas_low:
        char_matches = [c1 != c2 for c1, c2 in zip_longest(lemma, form)]
        dist = 1 - (sum(char_matches)/len(char_matches))
        if dist > 0.25:
            insort(form_distances, (dist, lemma))
    return (form, form_distances)

@profile
def profile_distance_calcs():
    lemmas_low = [lemma.lower() for lemma in lemmas]
    forms_low = [form.lower() for form in forms]
    results = Parallel(n_jobs=-1, prefer="threads")(calc_distances(form, lemmas_low) for form in forms_low)
    for form, form_distances in results:
        distances[form] = form_distances

    with open("potential_lemmas_hamming-like.txt", "w") as f:
        for form, form_distances in distances.items():
            for dist, lemma in reversed(form_distances[-2:]):
                f.write(f"{form} ➝  {lemma}: {dist}n")

if __name__ == "__main__":
    profile_distance_calcs()
    profile.print_stats()

This was a HUGE improvement over everything I had tried before. Besides the test with the short lists in the example, I ran it with the actual lists containing around 190,000 strings and the processing time was 118 minutes. While I’m pretty sure this could be improved (one might look for ways to do it using some kind of vectorization – someone suggested using arrays from numpy or AI oriented libraries), for the time being, this is quite manageable. There is still a problem that doesn’t have to do with efficiency.

I mention this in my comment to @jqurious below but I will explain it here in more detail. Running the script above with the test list, one gets results like the following:

berate ➝  bezant: 0.5
berate ➝  become: 0.5

From a linguistic point of view, any English speaker would know that these pairs of words are not related (OK, unless you know about the history of the language and know that be- used to be a productive prefix). What I’m trying to do with this script is to determine what would be the appropriate lemma (the dictionary form or representative word) for all the variants of a particular word found in the texts of a corpus.

This is a diachronic corpus containing many texts from many different authors and from many different dialects of a language writen over a period of more than 5 centuries. A ‘u’ could often be used instead of ‘v’ or a ‘y’ instead of an ‘i’. An ‘h’ can als be often be missing from a word that is spelt with ‘h’ even in the same text by the same author. The variation is huge and yet even a modern speaker of the languate can usually detect whether the words are related quite easily. Of course, the speaker of the language is knowledgeable about the word structure and the morphology and so can immediately see that, for instance, ‘uermellíssima’ is related to ‘vermell’ despite the fact that a lot of characters are different.

Using Kyle’s suggestion with the actual lists, I got results like the following:

beato ➝  beat: 0.8
beatriç ➝  tectriu: 0.5714285714285714
beatriç ➝  teatral: 0.5714285714285714
beatte ➝  beats: 0.6666666666666667
beatus ➝  certus: 0.6666666666666667
beatíssim ➝  nequíssim: 0.6666666666666667
beatíssim ➝  gravíssim: 0.6666666666666667

Even if you don’t know the language (medieval Catalan in case anybody is interested), you can see how this is very wrong (using other algorithms like the Levenshtein or the cosine distance it is just hopeless). The lemmas ‘beat’ or ‘beats’ should ideally be the ones selected as being the "closest" in all these cases. Yet the algorithm does what it does.

Perhaps I haven’t looked hard enough, but with all the work in NLP, I’m surprised there aren’t other algorithms that could do better in this kind of scenario. I know this deviates a little bit from the main point in the original question but if anybody can give me some useful advice, I would greatly appreciate it.

Asked By: jfontana

||

Answers:

The following solution is based on your original code (Hamming distance) which offers an (almost) order of magnitude speed-up (~89.41%), averaged across five runs of each, as measured by line-profiler. Using this solution as a base for parallel processing may get you closer to the total processing times you are after.

To use line-profiler, pip install line-profiler and then run kernprof -l -v test.py after adding @profile and calling the function to be profiled from __main__.

from itertools import zip_longest
from bisect import insort

lemmas = ["Do", "dOne", "PUrpose", "can", "be", "use", "for", "cannon", "amuse", "useful", "user", "become", "downtown", "develop", "fulminate", "deduce", "de", "bezant"]
forms = ["doNE", "donE", "doIng", "purposeful", "canonical", "becareful", "being", "berate", "best", "bezant", "full", "fulmination", "predict", "downgrade", "down", "developing", "deduct", "deducing"]
distances = {}

@profile
def profile_distance_calcs():
    lemmas_low = [lemma.lower() for lemma in lemmas]
    forms_low = [form.lower() for form in forms]
    for form in forms_low:
        form_distances = []
        for lemma in lemmas_low:
            char_matches = [c1 != c2 for c1, c2 in zip_longest(lemma, form)]
            dist = 1 - (sum(char_matches)/len(char_matches))
            if dist > 0.25:
                insort(form_distances, (dist, lemma))
        distances[form] = form_distances

    with open("potential_lemmas_hamming.txt", "w") as f:
        for form, form_distances in distances.items():
            for dist, lemma in reversed(form_distances[-2:]):
                f.write(f"{form} ➝  {lemma}: {dist}n")

if __name__ == "__main__":
    profile_distance_calcs()

From the time profile breakdown below (total time: 0.00122992 s), you can get an idea of where the slow-downs are coming from.

The main culprit is (obviously) the distance computation which is why I switched the textdistance.hamming.normalized_similarity for a much more efficient (barebones) manual calculation of the same thing based on the textdistance hamming and hamming.normalized_similarity source code. I also believe using bisect.insort and maintaining a sorted list while inserting is faster than inserting all elements and then running heapq.nlargest.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    10                                           @profile
    11                                           def profile_distance_calcs():
    12         1          7.9      7.9      0.6      lemmas_low = [lemma.lower() for lemma in lemmas]
    13         1          7.0      7.0      0.6      forms_low = [form.lower() for form in forms]
    14        18          1.8      0.1      0.1      for form in forms_low:
    15        18          2.0      0.1      0.2          form_distances = []
    16       324         33.4      0.1      2.7          for lemma in lemmas_low:
    17       324        844.5      2.6     68.7              char_matches = [c1 != c2 for c1, c2 in zip_longest(lemma, form)]
    18       324        155.6      0.5     12.7              dist = 1 - (sum(char_matches)/len(char_matches))
    19       285         44.4      0.2      3.6              if dist > 0.25:
    20        39         12.3      0.3      1.0                  insort(form_distances, (dist, lemma))
    21        18          4.7      0.3      0.4          distances[form] = form_distances
    22
    23         1         52.5     52.5      4.3      with open("potential_lemmas_hamming.txt", "w") as f:
    24        17          4.2      0.2      0.3          for form, form_distances in distances.items():
    25        26         11.5      0.4      0.9              for dist, lemma in reversed(form_distances[-2:]):
    26        26         48.3      1.9      3.9                  f.write(f"{form} ➝  {lemma}: {dist}n")

Original Code Speed Profile

Here is your original code for comparison. I modified some aspects of it, the main difference is the use of heapq.nlargest as I believe you were after the 2 most similar lemmas for each form and not the 2 least similar which heapq.nsmallest provided.

from textdistance import hamming, cosine, jaro_winkler
import heapq

lemmas = ["do", "done", "purpose", "can", "be", "use", "for", "cannon", "amuse", "useful", "user", "become", "downtown", "develop", "fulminate", "deduce", "de", "bezant"]
forms = ["done", "done", "doing", "purposeful", "canonical", "becareful", "being", "berate", "best", "bezant", "full", "fulmination", "predict", "downgrade", "down", "developing", "deduct", "deducing"]
distances = {}
processed_pairs = set() # keep track of processed pairs

@profile
def profile_distance_calcs():
    for lemma in lemmas:
        if lemma is None:
            continue
        lemma_lower = lemma.lower()
        for form in forms:
            if form is None:
                continue        
            form_lower = form.lower()
            pair = (lemma_lower, form_lower)
            if pair not in processed_pairs:
                processed_pairs.add(pair)
                dist = hamming.normalized_similarity(lemma_lower, form_lower)
                if dist > 0.25: 
                    distances.setdefault(form_lower, []).append((dist, lemma_lower))

    # Find the closest pairs
    closest_pairs = {}
    for form, dist_lemmas in distances.items():
        closest_pairs[form] = heapq.nlargest(2, dist_lemmas)

    with open("potential_lemmas_orig.txt", "w") as f:
        for form, pairs in closest_pairs.items():
            for dist, lemma in pairs:
                f.write(f"{form} ➝  {lemma}: {dist}n")

if __name__ == "__main__":
    profile_distance_calcs()

Time profile breakdown for the original code (total time: 0.0114992 s):

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def profile_distance_calcs():
    13        18          2.4      0.1      0.0      for lemma in lemmas:
    14        18          1.9      0.1      0.0          if lemma is None:
    15                                                       continue
    16        18          6.4      0.4      0.1          lemma_lower = lemma.lower()
    17       324         38.8      0.1      0.3          for form in forms:
    18       324         32.6      0.1      0.3              if form is None:
    19                                                           continue
    20       324        108.2      0.3      0.9              form_lower = form.lower()
    21       324         46.9      0.1      0.4              pair = (lemma_lower, form_lower)
    22       306         60.2      0.2      0.5              if pair not in processed_pairs:
    23       306         92.0      0.3      0.8                  processed_pairs.add(pair)
    24       306      10828.9     35.4     94.2                  dist = hamming.normalized_similarity(lemma_lower, form_lower)
    25       270         47.5      0.2      0.4                  if dist > 0.25:
    26        36         24.1      0.7      0.2                      distances.setdefault(form_lower, []).append((dist, lemma_lower))
    27
    28                                               # Find the closest pairs
    29         1          0.2      0.2      0.0      closest_pairs = {}
    30        16          4.3      0.3      0.0      for form, dist_lemmas in distances.items():
    31        16         72.7      4.5      0.6          closest_pairs[form] = heapq.nlargest(2, dist_lemmas)
    32
    33         1         72.3     72.3      0.6      with open("potential_lemmas_orig.txt", "w") as f:
    34        16          4.2      0.3      0.0          for form, pairs in closest_pairs.items():
    35        26          6.5      0.3      0.1              for dist, lemma in pairs:
    36        26         49.0      1.9      0.4                  f.write(f"{form} ➝  {lemma}: {dist}n")

Measuring Natural Language Similarity

Measuring the similarity between two pieces of natural language text is a non-trivial task. Attempting to gauge spelling/morphological/semantic similarity based purely on rudimentary character-based metrics (e.g. Hamming distance, Levenshtein distance etc.) won’t suffice as these metrics fail to capture complex linguistic patterns (hence why neural network methods are commonly used to pick up these patterns in large bodies of text). With that being said, one can begin to add their own "rules" to calculate more "accurate" similarity scores. For example, the code below modifies the normalised Hamming similarity computation to track how many consecutive characters match, and then scales the "similarity score" accordingly. There is obviously scope for fine-tuning and/or increasing the complexity/number of rules used, but with more complexity comes slower processing times. This custom function avoids the issue of results like beatte ➝ beats: 0.667 and beatus ➝ certus: 0.667, instead scoring them as beatte ➝ beats 0.79167 and beatus ➝ certus 0.33).

def custom_hamming_norm_sim(strA, strB, scale=0.5):
    max_str_len = max(len(strA), len(strB))
    max_score_per_char = 1 / max_str_len
    penalty = 1
    score = 0
    for c1, c2 in zip_longest(strA, strB):
        if c1 != c2:
            penalty = penalty * scale
            score += max_score_per_char * penalty
        else:
            p = penalty / scale
            if p < max_score_per_char:
                penalty = p
            score += max_score_per_char * penalty
    return score


@profile
def profile_distance_calcs():
    lemmas_low = [lemma.lower() for lemma in lemmas]
    forms_low = [form.lower() for form in forms]
    for form in forms_low:
        form_distances = []
        for lemma in lemmas_low:
            dist = custom_hamming_norm_sim(lemma, form)
            if dist > 0.25:
                insort(form_distances, (dist, lemma))
        distances[form] = form_distances

    with open("potential_lemmas_hamming.txt", "w") as f:
        for form, form_distances in distances.items():
            for dist, lemma in reversed(form_distances[-2:]):
                f.write(f"{form} ➝  {lemma}: {dist}n")

if __name__ == "__main__":
    profile_distance_calcs()
Answered By: Kyle F Hartzenberg

Update:

Example which pads the strings to the same length which allows the use of rapidfuzz.distance.Hamming

import rapidfuzz
import numpy as np

lemmas = [
    "gran", "vermell", "groc", "atens", "Do", "dOne", "PUrpose", "can", "be", "use",
    "for", "cannon", "amuse", "useful", "user", "become", "downtown", "develop", "fulminate",
    "deduce", "de", "bezant",
]

forms = [
    "preriarenos", "Marinara", "Grand", "Gran", "Grans", "Grands", "Grandeses", "Grandullons",
    "grand", "grandissisimus", "gran", "grans", "grands", "grandeses", "grandullons", "grandullon",
    "grandullones", "uermell", "uermells", "vermell", "vermells", "vermella", "vermelles", "varmellíssimes",
    "uarmellíssimes", "uermellíssimes", "uarnellíssimes", "varmellíssima", "uermella", "uarmella", "uarnella",
    "varnella", "uarnellas", "varnellas", "varmella", "uermelles", "grog", "grogues", "doNE", "donE",
    "doIng", "purposeful", "canonical", "becareful", "being", "berate", "best", "bezant", "full", "fulmination",
    "predict", "downgrade", "down", "developing", "deduct", "deducing",
]

# Length of longest string
max_len = len(max(max(lemmas, key=len), max(forms, key=len), key=len))

# Pad with whitespace
lemma_lowers = [lemma.lower().ljust(max_len) for lemma in lemmas]
form_lowers = [form.lower().ljust(max_len) for form in forms]

scorer = rapidfuzz.distance.Hamming.normalized_similarity

distances = rapidfuzz.process.cdist(
    form_lowers, lemma_lowers, scorer=scorer, workers=-1
)

# apparently `heapq` is "slow"
# use np.argpartition instead: https://stackoverflow.com/a/23734295
top_n = 2

idxs = np.argpartition(-distances, top_n)[:, :top_n]
scores = -np.partition(-distances, top_n)[:, :top_n]

for n, (idx1, idx2, score1, score2) in enumerate(np.hstack([idxs, scores])):
    print(form_lowers[n], "➝ ", lemma_lowers[int(idx1)], ":", score1)
    print(form_lowers[n], "➝ ", lemma_lowers[int(idx2)], ":", score2)

You could use rapidfuzz directly.

https://maxbachmann.github.io/RapidFuzz/Usage/process.html#cdist

import rapidfuzz.process

scores = rapidfuzz.process.cdist(forms, lemmas, workers=-1)
nearest = scores.argmax(axis=1)

# nearest now contains the indexes of `lemmas` with highest closest score

Based on a small benchmark, your code took 1m58.226s

Using .cdist() took 0m11.481s

You can change the default scorer e.g. scorer=rapidfuzz.distance.JaroWinkler.distance

Answered By: jqurious