Nested Loop Optimisation in Python for a list of 50K items

Question:

I have a csv file with roughly 50K rows of search engine queries. Some of the search queries are the same, just in a different word order, for example "query A this is " and "this is query A".

I’ve tested using fuzzywuzzy’s token_sort_ratio function to find matching word order queries, which works well, however I’m struggling with the runtime of the nested loop, and looking for optimisation tips.

Currently the nested for loops take around 60 hours to run on my machine. Does anyone know how I might speed this up?

Code below:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
from tqdm import tqdm

filePath = '/content/queries.csv'

df = pd.read_csv(filePath)

table1 = df['keyword'].to_list()
table2 = df['keyword'].to_list()

data = []

for kw_t1 in tqdm(table1):
  for kw_t2 in table2:
     score = fuzz.token_sort_ratio(kw_t1,kw_t2)
     if score == 100 and kw_t1 != kw_t2:
       data +=[[kw_t1, kw_t2, score]]

data_df = pd.DataFrame(data, columns=['query', 'queryComparison', 'score'])

Any advice would be appreciated.

Thanks!

Asked By: Jamie Jamieson

||

Answers:

Since what you are looking for are strings consisting of identical words (just not necessarily in the same order), there is no need to use fuzzy matching at all. You can instead use collections.Counter to create a frequency dict for each string, with which you can map the strings under a dict of lists keyed by their frequency dicts. You can then output sub-lists in the dicts whose lengths are greater than 1.

Since dicts are not hashable, you can make them keys of a dict by converting them to frozensets of tuples of key-value pairs first.

This improves the time complexity from O(n ^ 2) of your code to O(n) while also avoiding overhead of performing fuzzy matching.

from collections import Counter

matches = {}
for query in df['keyword']:
    matches.setdefault(frozenset(Counter(query.split()).items()), []).append(query)

data = [match for match in matches.values() if len(match) > 1]

Demo: https://replit.com/@blhsing/WiseAfraidBrackets

Answered By: blhsing

Apply in pandas as usually works faster:

kw_t2 = df['keyword'].values()

def compare(kw_t1):
    found_duplicates = []
    score = fuzz.token_sort_ratio(kw_t1, kw_t2)
     if score == 100 and kw_t1 != kw_t2:
       found_duplicates.append(kw_t2)
    return found_duplicates

df["duplicates"] = df['keyword'].apply(compare)
Answered By: Daniel Kanzel

I don’t think you need fuzzywuzzy here: you are just checking for equality (score == 100) of the sorted queries, but with token_sort_ratio you are sorting the queries over and over. So I suggest to:

  • create a "base" list and a "sorted-elements" one
  • iterate on the elements.

This will still be O(n^2), but you will be sorting 50_000 strings instead of 2_500_000_000!

filePath = '/content/queries.csv'
df = pd.read_csv(filePath)

table_base = df['keyword'].to_list()
table_sorted = [sorted(kw) for kw in table_base]
data = []
ln = len(table_base)

for i in range(ln-1):
    for j in range(i+1,ln):
        if table_sorted[i] == table_sorted[j]:
            data +=[[table_base[i], table_base[j], 100]]

data_df = pd.DataFrame(data, columns=['query', 'queryComparison', 'score'])
Answered By: gimix