Filter for rows if any value in a list of substrings is contained in any column in a dataframe

Question:

Suppose I have a dataframe df as:

df = pd.DataFrame({'Index': [1, 2, 3, 4, 5],
                   'Name': ['A', 'B', 100, 'C', 'D'],
                   'col1': [np.nan, 'bbby', 'cccy', 'dddy', 'EEEEE'],
                   'col2': ['water', np.nan, 'WATER', 'soil', 'cold air'],
                   'col3': ['watermelone', 'hot AIR', 'air conditioner', 'drink', 50000],
                  'Results': [1000, 2000, 3000, 4000, 5000]})


Out

Index  Name  col1     col2         col3           Results
    1  A     NaN    water       watermelone        1000
    2  B     bbbY    NaN         hot AIR           2000
    3  100   cccY    water       air conditioner   3000
    4  C     dddf    soil        drink             4000
    5  D     EEEEE   cold air    50000             5000

I have a list: matches = ['wat','air']

How can I choose all rows with col1 or col2 or col3 containing i in matches.

Expected output:

Index  Name  col1     col2         col3           Results
    1  A     NaN     water       watermelone       1000
    2  B     bbbY    NaN         hot AIR           2000
    3  100   cccY    water       air conditioner   3000

    5  D     EEEEE   cold air    50000              5000
Asked By: Joe

||

Answers:

You can use .T to transpose the dataframe and str.contains to check the values column-wise and then transpose back (also str.contains can have multiple values passed to if separated with |, which is why I change the list to a string with matches = '|'.join(matches)).

The benefit of transposing the dataframe is that you can use column-wise pandas method instead of looping through rows or a long lambda x: list comprehension. This technique should have good performance compared to a lambda x with axis=1 answer:

# df = df.set_index('Index')
matches = ['wat','air']
matches = '|'.join(matches)
df = df.reset_index(drop=True).T.fillna('')
df = df.T[[df[col].str.lower().str.contains(matches).values.any() for col in df.columns]]
df
Out[1]: 
  Name   col1      col2             col3
0    A            water      watermelone
1    B   bbbY                    hot AIR
2    B   cccY     water  air conditioner
4    D  EEEEE  cold air              eat
Answered By: David Erickson

Try this as well:

df = df[df['col1'].str.contains('|'.join(matches))|df['col2'].str.contains('|'.join(matches))|df['col3'].str.contains('|'.join(matches))]

Prints:

  Name   col1      col2             col3
1    A   aadY     water      watermelone
2    B   bbbY       air          hot AIR
3    B   cccY     water  air conditioner
5    D  EEEEE  cold air              eat
Answered By: sharathnatraj

Here is how I would do it:

df.loc[(df.apply(lambda x: any(m in str(v).lower() 
                               for v in x.values 
                               for m in matches), axis=1))]

Here is a efficiency comparison test with @David Erikson answer on a 50,000 row dataframe on a 15 core cpu setup:

test_df = pd.concat([df]*10000)
test_df.reset_index(drop=True, inplace=True)
%timeit result_df = test_df.loc[(test_df.apply(lambda x: any(m in str(v).lower() for v in x.values for m in matches), axis=1))]

455 ms ± 5.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

matchess = '|'.join(matches)
%timeit test_df = test_df.reset_index(drop=True).T.fillna('')

2.97 s ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_df_T = test_df_T.applymap(str)

7.94 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_df_2 = test_df.T[[test_df[col].str.lower().str.contains(matchess).values.any() for col in test_df.columns]]

28 s ± 1.6 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

So just the last line of the second method is more than 56 times slower. (all together ~80 times slower)

Answered By: Ehsan Fathi
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.