Python pandas cross check substring and contain relations between columns

Question:

A B C
catastrop animal True
ani count True
rows cata False
Second rownumbers False
counter counters True
column inner False
time
strong
kind
membrane

A and B contain both string and substring, I want to check if column A has a value ‘counter’, and that value is either having a substring in column B ‘count’ or is a substring of column B ‘counters’. If any of this satisfied, return True. (Something like combination of .isin and .str.contains). C is the output, which state whether the value of A is substring of B or A contains B. Like catastrop contain cata, so True, and ani is substring of animal so True

The initial code came into my mind is

list1=a1['A'].tolist()
output1=[]
for i in list1:
    output1.append(any(a1['B'].str.contains(i,regex=False)))

and do an or operation with column B

But if I doing it reversely, like

list2=a1['B'].tolist()
output2=[]
for i in list2:
    output2.append(any(a1['A'].str.contains(i,regex=False)))

list 2 would contain verify result about column B, rather than column A.

How could I write this code?

Asked By: XJTLU media

||

Answers:

One way to achieve this is as follows:

  • For a_contains_b, use Series.str.contains with a string consisting of a list of values from column B, joined with | delimiter to create alternatives (so: 'animal|count|cata|rownumbers|counters|inner|time|strong|kind|membrane').
  • For b_contains_a, we want to use Series.str.extractall instead, since we need to feed the result back to Series.isin to know which rows from column A have been matched. I.e. the result of extractall (+ [0].tolist()) will be ['ani', 'counter'], which we use as input for df.A.isin.
  • Finally, we use both boolean series as input for the new column. Use |, since we need True if either series contains True. Wrap the result in brackets and apply Series.where combined with Series.notna to overwrite all False values in rows that have NaN values in column A.
import pandas as pd
import numpy as np

data = {'A': {0: 'catastrop', 1: 'ani', 2: 'rows', 3: 'Second', 4: 'counter', 
              5: 'column', 6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan}, 
        'B': {0: 'animal', 1: 'count', 2: 'cata', 3: 'rownumbers', 
              4: 'counters', 5: 'inner', 6: 'time', 7: 'strong', 8: 'kind', 
              9: 'membrane'}, 
        'C': {0: True, 1: True, 2: False, 3: False, 4: True, 5: False, 
              6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan}}

df = pd.DataFrame(data)

a_contains_b = df.A.str.contains('|'.join(df.B.dropna().tolist()))

b_contains_a = df.A.isin(df.B.str.extractall(
    '('+'|'.join(df.A.dropna().tolist()) + ')')[0].tolist())

df['D'] = (b_contains_a | a_contains_b).where(df.A.notna())

print(df)

           A           B      C      D
0  catastrop      animal   True   True
1        ani       count   True   True
2       rows        cata  False  False
3     Second  rownumbers  False  False
4    counter    counters   True   True
5     column       inner  False  False
6        NaN        time    NaN    NaN
7        NaN      strong    NaN    NaN
8        NaN        kind    NaN    NaN
9        NaN    membrane    NaN    NaN

N.B. If you want the matching to be case-insensitive, consider using Series.str.lower. In the case of str.contains you can also make use of the parameter case, and set it to False.

Answered By: ouroboros1
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.