How can I make this Indexing algorithm more efficient?

Question:

I’ve got a Dataframe (deriving from a csv file with various columns) with 172033 rows. I’ve created a custom indexing function that blocks pairs of records that haven’t got similar ‘name’ attributes. The problem resides in the efficiency of the algorithm. Just to get to the 10th iteration it takes about a minute. Therefore indexing the whole dataset would take way too much time. How can I make my algorithm more efficient?

class CustomIndex(BaseIndexAlgorithm):
    def _link_index(self, df_a, df_b):
        indici1=[]
        indici2=[]
        for i in range(0, 173033):
            if(i%2 == 0):
                print(i) #keeps track of the iteration
            for j in range(i, 173033):
                if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.5):
                    indici1.append(i)
                    indici2.append(j)
        
        indici = [indici1, indici2]
        return pd.MultiIndex.from_arrays(indici, names=('first', 'second'))

I want to obtain a MultiIndex object, which would be an array of tuples contains the indexes of the pairs of records which are similar enough to not be blocked.

[MultiIndex([(     0,    0),
             (     0,    22159),
             (     0,    67902),
             (     0,    67903),
             (     1,    1),
             (     1,    1473),
             (     1,    5980),
             (     1,    123347),
             (     2,    2),
             ...

Here’s the code for the similarity function:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

Here’s an example of the dataframe I have as input:

   name
0  Amazon
1  Walmart
2  Apple
3  Amazon.com
4  Walmart Inc.

I would like the resulting MultiIndex to contain tuple links between 0 and 3, 1 and 4 and all the repetitions (0 and 0, 1 and 1 etc.)

Asked By: Maurizio Brini

||

Answers:

You are using .append method of list, that method according to PythonWiki is O(1) but Individual actions may take surprisingly long, depending on the history of the container.. You might use collections.deque which does have such quirks, just add import collections and do

indici1=collections.deque()
indici2=collections.deque()
...
indici = [list(indici1), list(indici2)]

If that would not help enough you would need similar function for possible improvements.

Answered By: Daweo

We asked you to reveal the similar() function but you’ve declined to do so.

You wrote

        for i in range(0, 173033):
            ...
            for j in range(i, 173033):
                if similar(df_a.loc[i, 'name'], df_a.loc[j, 'name']) > 0.5:

So you plan to call that mysterious similar 29_940_419_089 (30 billion) times.
I’m guessing that’s going to take some little while, maybe two weeks?
We describe this as O(n^2) quadratic performance.

Here is the way out.

First, sort your dataset. It will only cost O(n log n), cheap!

Next, find something in the similar loss function that allows you to
simplify the problem, or design a new related loss function which allows that.
Spoiler alert — we can only do comparisons against local neighbors,
not against far flung entries more than 100,000 positions away.
I will focus on the five example words.

It seems likely that your mysterious function reports
a large value for similar("Walmart", "Walmart Inc.")
and a small value when comparing those strings against "Apple".
So let’s adopt a new cost function rule.
If initial character of the two strings is not identical,
the similarity is immediately reported as 0.

So now we’re faced with comparing Apple against two Amazons,
and Walmart with itself.
We have partitioned the problem into "A" and "W" subsets.
The quadratic monster is starting to die.

Minor side note: Since your similarity function likely is
symmetric, it suffices to examine just half the square,
where i < j.

A practical implementation for a dataset of this size
is going to need to be more aggressive, perhaps insisting
that initial 2 or 3 letters be identical.
Perhaps you can run a hyphenation algorithm and look
at identical syllables.
Or use Metaphone
or Soundex.

The simplest possible thing you could do is define a
window W of maybe 10. When examining the sorted dataset,
we arbitrarily declare that similarity between i, j entries
shall be zero if abs(i - j) > W.
This may impact accuracy, but it’s great for performance,
it lets you prune the search space before you even call the function.
We went from O(n^2) to O(n) linear.
Perfect accuracy is hardly relevant if you never wait
long enough for the code to produce an answer.

Use these ideas to code up a solution, and
let us know
how you resolved the details.


EDIT

@Pierre D. advocates the use of
locality sensitive hashing
to avoid lost revenue due to {prefix1}{entity} and {prefix2}{entity}
being far apart in a sorted list yet close together IRL and in hash space.
Metaphone is one technique for inducing LSH collisions,
but it seldom has any effect on initial letter / sort order.
Das, et al.,
describe a MinHash technique in
"Google News Personalization: Scalable Online Collaborative Filtering".
Number of syllables in the business names you’re examining
will be significantly less than size of click-sets seen by Google.
Hashes give binary results of hit or miss,
rather than a smooth ranking or distance metric.
The need to promote neighbor collisions,
while steering clear of the birthday paradox,
makes it difficult to recommend tuning parameters
absent a snapshot of the production dataset.
Still, it is a technique you should keep in mind.

Asking spacy to do a
named entity recognition
pre-processing pass over your data might also be
worth your while, in the hopes of normalizing
or discarding any troublesome "noise" prefixes.

Answered By: J_H

As others have pointed out, the solution to your problem requires O(N^2) running time, which means it won’t scale well for very large datasets. Nonetheless, I think there’s still a lot of room for improvement.

Here are some strategies you can use to speed up your code:

  1. If your dataset contains many duplicate name values, you can use "memoization" to avoid re-computing the similar score for duplicate name pairs. Of course, caching all 172k^2 pairs would be devastatingly expensive, but if the data is pre-sorted by name, then lru_cache with 172k items should work just fine.

  2. Looking at the difflib documentation, it appears that you have the option of quickly filtering out "obvious" mismatches. If you expect most pairs to be "easy" to eliminate from consideration, then it makes sense to first call SequenceMatcher.quick_ratio() (or even real_quick_ratio()), followed by ratio() only if necessary.

  1. There will be some overhead in the ordinary control flow.

    • Calling df.loc many times in a for-loop might be a bit slow in comparison to simple iteration.
    • You can use itertools.combinations to avoid writing a nested for-loop yourself.
    • BTW, tqdm provides a convenient progress bar, which will give a better indication of true progress than the print statements in your code.

Lastly, I saw no need for the df_b parameter in your function above, so I didn’t include it in the code below. Here’s the full solution:

import pandas as pd
from difflib import SequenceMatcher
from functools import lru_cache
from itertools import combinations
from tqdm import tqdm

@lru_cache(173_000)
def is_similar(a, b):
    matcher = SequenceMatcher(None, a, b)
    if matcher.quick_ratio() <= 0.5:
        return False
    return matcher.ratio() > 0.5

def link_index(df):
    # We initialize the index result pairs with [(0,0), (1,1,), (2,2), ...]
    # because they are trivially "linked" and your problem statement
    # says you want them in the results.
    indici1 = df.index.tolist()
    indici2 = df.index.tolist()

    # Sort the names so that our lru_cache is effective,
    # even though it is limited to 173k entries.
    name_items = df['name'].sort_values().items()

    pairs = combinations(name_items, 2)
    num_pairs = math.comb(len(names), 2)
    for (i, i_name), (j, j_name) in tqdm(pairs, total=num_pairs):
        if is_similar(i_name, j_name):
            indici1.append(i)
            indici2.append(j)

    indici = [indici1, indici2]
    links = pd.MultiIndex.from_arrays(indici, names=('first', 'second'))
    return links.sortlevel([0,1])[0]

Quick Test:

names = ['Amazon', 'Walmart', 'Apple', 'Amazon.com', 'Walmart Inc.']
df = pd.DataFrame({'name': names})
link_index(df)

Output:

(MultiIndex([(0, 0),
             (0, 3),
             (1, 1),
             (1, 4),
             (2, 2),
             (3, 3),
             (4, 4)],
            names=['first', 'second']),
 array([0, 5, 1, 6, 2, 3, 4]))

Let me know if that speeds things up on your actual data!

Let’s set some realistic expectations.

Your original estimate was ~1 minute for 10 "iterations". That implies the total time would have been ~6 days:

print(math.comb(172033, 2) / (10*172033) / 60 / 24)

On the other hand, merely iterating through the full set of i,j combinations and doing absolutely nothing with them would take ~45 minutes on my machine. See for yourself:

sum(1 for _ in tqdm(combinations(np.arange(172033), 2), total=math.comb(172033, 2)))

So the real solution will take longer than that. Now you’ve got some bounds on what the optimal solution will require: Somewhere between ~1 hour and ~6 days. Hopefully it’s closer to the former!

Answered By: Stuart Berg

In the end this is what I came up. After sorting the dataset in alphabetical order according to the attribute ‘name’, I use this custom index. Every row in the dataset is compared to its neighbours in the range of 100 rows.

class CustomIndex(BaseIndexAlgorithm):
    def _link_index(self, df_a, df_b):
        t0 = time.time()
        indici1=[]
        indici2=[]
        for i in range(1, 173034):
            if(i%500 == 0):
                print(i)
            if(i<100):
                n = i
                for j in range((i-(n-1)), (i+100)):
                    if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
                        indici1.append(i)
                        indici2.append(j)
            elif(i>172932):
                n = 173033 - i
                for j in range((i-100), (i+n)):
                    if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
                        indici1.append(i)
                        indici2.append(j)
            else:
                for j in range((i-100), (i+100)):
                    if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
                        indici1.append(i)
                        indici2.append(j)
        
        indici = [indici1, indici2]
        t1 = time.time()
        print(t1-t0)
        return pd.MultiIndex.from_arrays(indici, names=('first', 'second'))

Indexing the dataset takes about 20 minutes on my machine which is great compared to before. Thanks everyone for the help!

I’ll be looking into locality sensitive hashing suggested by Pierre D.

Answered By: Maurizio Brini