Python Pandas Extract text between words and symbols with regex

Question:

I am trying to extract certain words between some text and symbol using regex in Python Pandas. The problem is that sometimes there are non-characters after the word I want to extract, and sometimes there is nothing.

Here is the input table.

enter image description here

Here is the expected output.

enter image description here

I’ve tried this str.extract(r'[a-zA-ZW]+s+Reason to buy:([a-zA-Zs]+)[a-zA-ZW]+s+’) but does not work.

Any advice is appreciated.

Thanks.

Asked By: user13617491

||

Answers:

You can use a lookaround to extract your reason to buy:

(?<=Reason to buy: )([^<]+)

And use it in your Python code as follows:

import pandas as pd
import re

df = pd.DataFrame([
    [1, 'Team: Asian<>Reason to buy: leisure'             ],
    [2, 'Team: SouthAmerica<>Reason to buy: educational<>'],
    [3, 'Team: Australia<>Reason to buy: commercial'      ],
    [4, 'Team: North America<>Reason to buy: leisure<>'   ],
    [5, 'Team: Europe<>Reason to buy: leisure<>'          ],
], columns = ['team_id', 'reasons'])

pattern = r'(?<=Reason to buy: )([A-Za-z]+)'

df['reasons'] = df['reasons'].str.extract(pattern)

Output:

   team_id      reasons
0        1      leisure
1        2  educational
2        3   commercial
3        4      leisure
4        5      leisure

Check the Regex demo and the Python demo.

Answered By: lemon

You just need to capture the part matching any zero or more chars up to <> or end of string:

df['reasons'] = df['reasons'].str.extract(r"Reason to buy:s*(.*?)(?=<>|$)", expand=False)

See the regex demo.

Details:

  • Reason to buy: – a string
  • s* – zero or more whitespace
  • (.*?) – a capturing group with ID 1 that matches zero or more chars other than line break chars as few as possible
  • (?=<>|$) – a positive lookahead that requires a <> string or end of string immediately to the right of the current location.

Note that Series.str.extract requires at least one capturing group in the pattern to actually return a value, so you can even use a non-capturing group instead of the positive lookahead:

df['reasons'] = df['reasons'].str.extract(r"Reason to buy:s*(.*?)(?:<>|$)", expand=False)