Is there a way to speed up matching addresses and level of confidence per match between two data frames for large datasets?

Question:

I have got a script below that check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match.

I am using rapid fuzz I heard it is faster than fuzzywuzzy. However it is still taking a very long time to do the match and calculations. Here is the CSV files.
main_dataset.csv contains about 3 million records, and reference_dataset.csv contains about 10 records.

Below is the time it took for each record.

start time: Thu Oct  6 10:51:18 2022
end time: Thu Oct  6 10:51:23 2022
start time: Thu Oct  6 10:51:23 2022
end time: Thu Oct  6 10:51:28 2022
start time: Thu Oct  6 10:51:28 2022
end time: Thu Oct  6 10:51:32 2022
start time: Thu Oct  6 10:51:32 2022
end time: Thu Oct  6 10:51:36 2022
start time: Thu Oct  6 10:51:36 2022
end time: Thu Oct  6 10:51:41 2022
start time: Thu Oct  6 10:51:41 2022
end time: Thu Oct  6 10:51:45 2022
start time: Thu Oct  6 10:51:45 2022
end time: Thu Oct  6 10:51:50 2022
start time: Thu Oct  6 10:51:50 2022
end time: Thu Oct  6 10:51:54 2022
start time: Thu Oct  6 10:51:54 2022
end time: Thu Oct  6 10:51:59 2022

My script is here:

import pandas as pd
from rapidfuzz import process, fuzz
import time
from dask import dataframe as dd

ref_df = pd.read_csv('reference_dataset.csv')
df = dd.read_csv('main_dataset.csv', low_memory=False)

contacts_addresses = list(df.address)
ref_addresses = list(ref_df.ref_address.unique())

def scoringMatches(x, s):
    o = process.extract(x, s, score_cutoff = 60)
    if o != None:
        return o[1]

def match_addresses(add, contacts_addresses, min_score=0):
    response = process.extract(add, contacts_addresses, scorer=fuzz.token_sort_ratio)
    return response


def get_highest_score(scores):
    total_scores = []
    for val in scores:
        total_scores.append(val[1])
    max_value = max(total_scores)
    max_index = total_scores.index(max_value)
    return scores[max_index]


scores_list = []
names = []
for x in ref_addresses:
    # start = time.time()
    # print("start time:", time.ctime(start))
    scores = match_addresses(x, contacts_addresses, 75)
    match = get_highest_score(scores)
    name = (str(x), str(match[0]))
    names.append(name)
    score = int(match[1])
    scores_list.append(score)
    # end = time.time()
    # print("end time:", time.ctime(end))
name_dict = dict(names)

match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
scores_df = pd.DataFrame(scores_list)

merged_results_01 = pd.concat([match_df, scores_df], axis=1)

merged_results_02 = pd.merge(ref_df, merged_results_01, how='right', on='ref_address')
merged_results_02.to_csv('results.csv')

Asked By: Kelly Tang

||

Answers:

It is recommended to use process.cdist which compares two sequences and obtains a similarity matrix instead of process.extract/process.extractOne right now, since a lot of the newer performance improvements only got added to this algorithm so far.

Namely those improvements are:

  1. support for multithreading using the workers argument
  2. support to compare multiple short sequences (<= 64 characters) in parallel using SIMD on x64.

Both of these improvements will be added to process.extract and process.extractOne at some point, but at this point (rapidfuzz==v2.11.1) they only exist.

A couple relevant issues for future improvements on this front are:

This could be e.g. implemented in the following way:

from itertools import islice

chunk_size = 100
ref_addr_iter = iter(ref_addresses)
while ref_addr_chunk := list(islice(ref_addr_iter, chunk_size)):
    scores = process.cdist(ref_addr_chunk, contacts_addresses, scorer=fuzz.token_sort_ratio, score_cutoff=75, workers=-1)
    max_scores_idx = scores.argmax(axis=1)
    for ref_addr_idx, score_idx in enumerate(max_scores_idx):
        names.append((ref_addr_chunk[ref_addr_idx], contacts_addresses[score_idx]))
        scores_list.append(scores[ref_addr_idx,score_idx])
Answered By: maxbachmann
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.