compare two columns of pandas dataframe with a list of strings

Question:

This is my dataframe:

import pandas as pd
df = pd.DataFrame({'a': ['axy a', 'xyz b'], 'b': ['obj e', 'oaw r']})

and I have a list of strings:

s1 = 'lorem obj e'
s2 = 'lorem obj e lorem axy a'
s3 = 'lorem xyz b lorem oaw r'
s4 = 'lorem lorem oaw r'
s5 = 'lorem lorem axy a lorem obj e'
s_all = [s1, s2, s3, s4, s5]

Now I want to take every row and check whether both columns of the row are present in any of strings in s_all. For example for first row I select axy_a and obj_e and check if both of them are present in the strings of s_all. Both of them are present in s2 and s5.

the outcome that I want looks like this one:

       a      b      c
0  axy a  obj e  lorem obj e lorem axy a
1  axy a  obj e  lorem lorem axy a lorem obj e
2  xyz b  oaw r  lorem xyz b lorem oaw r

Here is my try but it didn’t work:

l = []
for sentence in s_all:
    for i in range(len(df)):
        if df.a.values[i] in sentence and df.b.values[i] in sentence:
            l.append(sentence)
        else:
            l.append(np.nan)

I tried to append the result into a list and then use that list to create the c column that I want but it didn’t work.

Asked By: Amir

||

Answers:

you can write a little helper function and apply this function row by row to your df:

def func(row):
    out = []
    a, b = row 
    for s in s_all:
        if all([a in s, b in s]):
            out.append(s)
    return out

# if you have more than 2 columns or don't know how many, here more general approach
# other than that, same function as above
def func(row):
    out = [] 
    for s in s_all:
        if all([string in s for string in row.tolist()]):
            out.append(s)
    return out

df['c'] = df.apply(func, axis=1)

Or as one-liner with a lambda function:

df['c'] = df.apply(lambda row: [s for s in s_all if all(string in s for elem in row.tolist() for string in elem)], axis=1)

The function returns a list with results.
To make each list element its own row, we use explode

df = df.explode(column='c')
print(df)

Output:

       a      b                              c
0  axy a  obj e        lorem obj e lorem axy a
0  axy a  obj e  lorem lorem axy a lorem obj e
1  xyz b  oaw r        lorem xyz b lorem oaw r
Answered By: Rabinzel

Due to multiple occurrences of patterns in a and b in the reference strings, you need to repeat their listings as well. This happens by appending l_a and l_b. In turn, a new dataframe df_new is constructed. Modifying your for loop will do.

l = []
l_a = []
l_b = []
for i in range(len(df)):
    for sentence in s_all:
        if df.a.values[i] in sentence and df.b.values[i] in sentence:
            l.append(sentence)
            l_a.append(df.a.values[i])
            l_b.append(df.b.values[i])

df_new = pd.DataFrame({'a' : l_a, 'b' : l_b, 'c' : l})

This yields

a b c
0 axy a obj e lorem obj e lorem axy a
1 axy a obj e lorem lorem axy a lorem obj e
2 xyz b oaw r lorem xyz b lorem oaw r
Answered By: 7shoe

You can create a new series object using apply and explode and concat that with your DataFrame

match_series = df.apply(lambda row: [s for s in s_all if row['a'] in s and row['b'] in s], axis=1).explode()
pd.concat([df, match_series], axis=1)

Output

       a      b                              0
0  axy a  obj e        lorem obj e lorem axy a
0  axy a  obj e  lorem lorem axy a lorem obj e
1  xyz b  oaw r        lorem xyz b lorem oaw r
Answered By: Mortz
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.