List search in pandas text column

Question:

I have a dataset with two columns: date and text. The text column contains unstructured information. I have a list of city names to search for in a text column.

I need to get two sets of data:

list_city = [New York, Los Angeles, Chicago]

  1. When all records from the list with a text message match with the dataframe lines

Sample example:

df_1
data         text
06-02-2022   New York, Los Angeles, Chicago, Phoenix
05-02-2022   New York, Houston, Phoenix
04-02-2022   San Antonio, San Diego, Jacksonville

Need result df_1_res:

df_1_res
data         text
06-02-2022   New York, Los Angeles, Chicago, Phoenix

I tried this design, it works, but it doesn’t look very nice:

df_1_res= df_1.loc[df_1["text"].str.contains(list_city[0]) & df_1["text"].str.contains(list_city[1]) & df_1["text"].str.contains(list_city[2])]
  1. When at least one value from the list matches the text in the dataframe lines

Sample example:

df_2
data         text
06-02-2022   New York, Los Angeles, Chicago, Phoenix
05-02-2022   New York, Houston, Phoenix
04-02-2022   San Antonio, San Diego, Jacksonville

Need result df_2_res:

df_2_res
data         text
06-02-2022   New York, Los Angeles, Chicago, Phoenix
05-02-2022   New York, Houston, Phoenix

I tried this design, it works, but it doesn’t look very nice:

df_2_res= df_2.loc[df_1["text"].str.contains(list_city[0]) | df_2["text"].str.contains(list_city[1]) | df_2["text"].str.contains(list_city[2])]

How can it be improved? Since it is planned to change the number of cities in the filtering list.

Asked By: kostya ivanov

||

Answers:

here is one way to do it

For case # 1 : AND Condition

# use re.IGNORECASE to make findall case insensitive

import re
(df_1.loc[df_1['text'].str
        .findall(r'|'.join(list_city),  re.IGNORECASE)
        .apply(lambda x: len(x)).eq(len(list_city))])

data    text
0   06-02-2022  New York, Los Angeles, Chicago, Phoenix

CASE #2 : OR Condition


#create an OR condition using join
# filter using loc

df_2.loc[df_1['text'].str.contains(r'|'.join(list_city))]
    data        text
0   06-02-2022  New York, Los Angeles, Chicago, Phoenix
1   05-02-2022  New York, Houston, Phoenix
Answered By: Naveed

Try using the isin() function
check out how to use     isin function

Output:

Output of the code

Answered By: Adeyemi Michael

Solution with pandas.Series.str.count

list_city = ['New York', 'Los Angeles', 'Chicago']
print('nfirst task - ALL')
df1 = df[df['text'].str.count(r'|'.join(list_city)).eq(len(list_city))]
print(df1)

print('nsecond task - ANY')
df1 = df[df['text'].str.count(r'|'.join(list_city)).gt(0)]
print(df1)
first task - ALL
         data                                     text
0  06-02-2022  New York, Los Angeles, Chicago, Phoenix

second task - ANY
         data                                     text
0  06-02-2022  New York, Los Angeles, Chicago, Phoenix
1  05-02-2022               New York, Houston, Phoenix
Answered By: Алексей Р
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.