Check if Pandas DataFrame cell contains certain string

Question:

Suppose I have the following Pandas DataFrame:

     a               b
0  NAN  BABA UN EQUITY
1  NAN            2018
2  NAN            2017
3  NAN            2016
4  NAN             NAN
5  NAN   700 HK EQUITY
6  NAN            2018
7  NAN            2017
8  NAN            2016
9  NAN             NAN

For each cell in column b, I want to check if it contains the string EQUITY. If it does, I want to replace the cells in column a, the next row with the previous string until a row that is NAN, to get the edited DataFrame as follows:

                a               b
0             NAN  BABA UN EQUITY
1  BABA UN EQUITY            2018
2  BABA UN EQUITY            2017
3  BABA UN EQUITY            2016
4             NAN             NAN
5             NAN   700 HK EQUITY
6   700 HK EQUITY            2018
7   700 HK EQUITY            2017
8   700 HK EQUITY            2016
9             NAN             NAN

My actual DataFrame is much larger than the above, but the format is similar. I am having trouble figuring out how to check whether a cell contains EQUITY. It seems that str.contains is what I should be using, but it’s not clear to me how to do that.

Asked By: turtle101

||

Answers:

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': ['NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN'],
 'b': ['BABA UN EQUITY', '2018', '2017', '2016', 'NAN', '700 HK EQUITY', '2018', '2017', '2016', 'NAN']})

# Make sure that all NaN values are `np.nan` not `'NAN'` (strings)
df = df.replace('NAN', np.nan)
mask = df['b'].str.contains(r'EQUITY', na=True)
df.loc[mask, 'a'] = df['b']
df['a'] = df['a'].ffill()
df.loc[mask, 'a'] = np.nan

yields

                a               b
0             NaN  BABA UN EQUITY
1  BABA UN EQUITY            2018
2  BABA UN EQUITY            2017
3  BABA UN EQUITY            2016
4             NaN             NaN
5             NaN   700 HK EQUITY
6   700 HK EQUITY            2018
7   700 HK EQUITY            2017
8   700 HK EQUITY            2016
9             NaN             NaN

One slightly tricky bit above is how mask is defined. Notice that str.contains
returns a Series which contains not only True and False values, but also NaN:

In [114]: df['b'].str.contains(r'EQUITY')
Out[114]: 
0     True
1    False
2    False
3    False
4      NaN
5     True
6    False
7    False
8    False
9      NaN
Name: b, dtype: object

str.contains(..., na=True) is used to make the NaNs be treated as True:

In [116]: df['b'].str.contains(r'EQUITY', na=True)
Out[116]: 
0     True
1    False
2    False
3    False
4     True
5     True
6    False
7    False
8    False
9     True
Name: b, dtype: bool

Once you have mask the idea is simple: Copy the values from b into a wherever mask is True:

df.loc[mask, 'a'] = df['b']

Forward-fill the NaN values in a:

df['a'] = df['a'].ffill()

Replace the values in a with NaN wherever mask is True:

df.loc[mask, 'a'] = np.nan
Answered By: unutbu

1. List comprehension is faster than str.contains

Even though they’re vectorized, Pandas string methods are not optimized, so falling back to Python loops is often much faster, even for large dataframes.

For example, the following two are equivalent (yet the list comprehension is approx. 3 times faster no matter the length of the column):

msk1 = df['b'].str.contains(r'EQUITY', na=True)

msk = [s != s or 'EQUITY' in s for s in df['b'].tolist()]

all(msk == msk1)   # True

That said, str.contains is (arguably) much more readable; besides both versions perform very fast and unlikely to be a bottleneck for performance of a code.

2. For multiple strings, use "|".join

To check if any of a list of strings exist in rows of a column, join them with a | separator and call str.contains:

lst = ['EQUITY', '16', '19', '20']
msk = df['b'].str.contains(r'|'.join(lst), na=True)

3. Filtering can be done with where() and mask()

The final desired output may be reached by masking certain values in column b using the boolean mask msk. First step is to remove the nonmatched values (numbers in the example in the OP) using where(); then propagate values using ffill(); finally, remove the values where the condition is True using mask().

msk = df['b'].str.contains(r'EQUITY', na=True)
df['a'] = df['b'].where(msk).ffill().mask(msk)

transformation

Answered By: cottontail