Filter DataFrame for most matches

Question:

I have a list (list_to_match = ['a','b','c','d']) and a dataframe like this one below:

Index One Two Three Four
1 a b d c
2 b b d d
3 a b d
4 c b c d
5 a b c g
6 a b c
7 a s c f
8 a f c
9 a b
10 a b t d
11 a b g
100 a b c d

My goal would be to filter for the rows with most matches with the list in the corrisponding position (e.g. position 1 in the list has to match column 1, position 2 column 2 etc…).
In this specific case, excluding row 100, row 5 and 6 would be the one selected since they match ‘a’, ‘b’ and ‘c’ but if row 100 were to be included row 100 and all the other rows matching all elements would be the selected.
Also the list might change in length e.g. list_to_match = [‘a’,’b’].

Thanks for your help!

Asked By: Dario Bani

||

Answers:

You can iterate over the columns, dropping rows that don’t match the corresponding element in the list to match. With a little extra bookkeeping, we stop filtering when an additional filter operation would produce an empty DataFrame:

for colname, item_to_match in zip(df.columns, list_to_match):
    filter_result = df[df[colname] == item_to_match]
    if len(filter_result.index):
        df = filter_result

This outputs:

      One Two Three Four  matches
Index
5       a   b     c    g        3
6       a   b     c  NaN        3
Answered By: BrokenBenchmark

I would use:

list_to_match = ['a','b','c','d']

# compute a mask of identical values
mask = df.iloc[:, :len(list_to_match)].eq(list_to_match)
# ensure we match values in order
mask2 = mask.cummin(axis=1).sum(axis=1)

# get the rows with max matches
out = df[mask2.eq(mask2.max())]
# or
# out = df.loc[mask2.nlargest(1, keep='all').index]

print(out)

Output (ignoring the input row 100):

      One Two Three  Four
Index                    
5       a   b     c     g
6       a   b     c  None
Answered By: mozway

Here is my approach. Descriptions are commented below.

import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine



data = {'One':  ['a', 'a', 'a', 'a'], 
        'Two':  ['b', 'b', 'b', 'b'],
        'Three':  ['c', 'c', 'y', 'c'], 
        'Four': ['g', 'g', 'z', 'd']}

dataframe_ = pd.DataFrame(data)


#encoding Letters into numerical values so we can compute the cosine similarities
dataframe_[:] = dataframe_.to_numpy().astype('<U1').view(np.uint32)-64

#Our input data which we are going to compare with other rows
input_data = np.array(['a', 'b', 'c', 'd'])

#encode input data into numerical values
input_data = input_data.astype('<U1').view(np.uint32)-64

#compute cosine similarity for each row
dataframe_out = dataframe_.apply(lambda row: 1 - cosine(row, input_data), axis=1)
print(dataframe_out)

output:

0    0.999343
1    0.999343
2    0.973916
3    1.000000

Filtering rows based on their cosine similarities:

df_filtered = dataframe_out[dataframe_out.iloc[:, [0]] > 0.99]
print(df_filtered)
0  0.999343
1  0.999343
2       NaN
3  1.000000

From here on you can easily find the rows with non-NaN values by their indexes.

Answered By: Ali
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.