Search for terms in list within dataframe column, add terms found to new column

Question:

Example dataframe:

data = pd.DataFrame({'Name': ['Nick', 'Matthew', 'Paul'],
                     'Text': ["Lived in Norway, England, Spain and Germany with his car",
                              "Used his bikes in England. Loved his bike",
                              "Lived in Alaska"]})

enter image description here

Example list:

example_list = ["England", "Bike"]

What I need

I want to create a new column, called x, where if a term from example_list is found as a string/substring in data.Text (case insensitive), it adds the word it was found from to the new column.

Output

enter image description here

So in row 1, the word England was found and returned, and bike was found and returned, as well as bikes (which bike was a substring of).

Progress so far:

I have managed – with the following code – to return terms that match the terms regardless of case, however it wont find substrings… e.g. if search for "bike", and it finds "bikes", I want it to return "bikes".

pattern = fr'({"|".join(example_list)})'
data['Text'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(", ")
Asked By: Nicholas

||

Answers:

I think I might have found a solution for your pattern there:

pattern = fr'({"|".join("[a-zA-Z]*" + ex + "[a-zA-Z]*" for ex in example_list)})'
data['x'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(",")

Basically what I do is, I extend the pattern by optionally allowing letters before the (I think you don’t explicitly mention this, maybe this has to be omitted) and after the word.

As an output I get the following:

enter image description here

I’m just not so sure, in which format you want this x-column. In your code you join it via commas (which I followed here) but in the picture you only have a list of the values. If you specify this, I could update my solution.

Answered By: Christian
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.