Pandas, filter rows which column contain another column
Question:
How can I filter rows which column contain another column?
For example, if we have DT with two columns A, B, can we filter rows with B.contains(A)? Not just if B contains some A values from all A from DT, but just in one row.
A B
'lol' 'lolec'
'ram' 'rambo'
'ki' 'pio'
Result:
A B
'lol' 'lolec'
'ram' 'rambo'
Answers:
You can use boolean indexing
with mask created by apply
and in
if need filter columns A
and B
per rows:
#if necessary strip ' in all values
df = df.apply(lambda x: x.str.strip("'"))
#df = df.applymap(lambda x: x.strip("'"))
print (df.apply(lambda x: x.A in x.B, axis=1))
0 True
1 True
2 False
dtype: bool
df = df[df.apply(lambda x: x.A in x.B, axis=1)]
print (df)
A B
0 lol lolec
1 ram rambo
Difference of solutions – input DataFrame
is changed:
print (df)
A B
0 lol pio
1 ram rambo
2 ki lolec
print (df[df.apply(lambda x: x.A in x.B, axis=1)])
A B
1 ram rambo
print (df[df['B'].str.contains("|".join(df['A']))])
A B
1 ram rambo
2 ki lolec
for improve performance use list comprehension:
df = df[[a in b for a, b in zip(df.A, df.B)]]
You can use str.contains
to match each of the substrings by using the regex |
character which implies an OR
selection from the contents of the other series:
df[df['B'].str.contains("|".join(df['A']))]
How can I filter rows which column contain another column?
For example, if we have DT with two columns A, B, can we filter rows with B.contains(A)? Not just if B contains some A values from all A from DT, but just in one row.
A B 'lol' 'lolec' 'ram' 'rambo' 'ki' 'pio' Result: A B 'lol' 'lolec' 'ram' 'rambo'
You can use boolean indexing
with mask created by apply
and in
if need filter columns A
and B
per rows:
#if necessary strip ' in all values
df = df.apply(lambda x: x.str.strip("'"))
#df = df.applymap(lambda x: x.strip("'"))
print (df.apply(lambda x: x.A in x.B, axis=1))
0 True
1 True
2 False
dtype: bool
df = df[df.apply(lambda x: x.A in x.B, axis=1)]
print (df)
A B
0 lol lolec
1 ram rambo
Difference of solutions – input DataFrame
is changed:
print (df)
A B
0 lol pio
1 ram rambo
2 ki lolec
print (df[df.apply(lambda x: x.A in x.B, axis=1)])
A B
1 ram rambo
print (df[df['B'].str.contains("|".join(df['A']))])
A B
1 ram rambo
2 ki lolec
for improve performance use list comprehension:
df = df[[a in b for a, b in zip(df.A, df.B)]]
You can use str.contains
to match each of the substrings by using the regex |
character which implies an OR
selection from the contents of the other series:
df[df['B'].str.contains("|".join(df['A']))]