pandas regex look ahead and behind from a 1st occurrence of character

Question:

I have python strings like below

"1234_4534_41247612_2462184_2131_GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"

I would like to do the below

a) extract characters that appear before and after 1st dot

b) The keywords that I want are always found after the last _ symbol

For ex: If you look at 2nd input string, I would like to get only PQRST.GHI as output. It is after last _ and before 1st . and we also get keyword after 1st .

So, I tried the below

for s in strings:
   after_part = (s.split('.')[1])
   before_part = (s.split('.')[0])
   before_part = qnd_part.split('_')[-1]
   expected_keyword = before_part + "." + after_part
   print(expected_keyword)

Though this works, this is definitely not nice and elegant way to write a regex.

Is there any other better way to write this?

I expect my output to be like as below. As you can see that we get keywords before and after 1st dot character

GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
Asked By: The Great

||

Answers:

You can do (try the pattern here )

df['text'].str.extract('_([^._]+.[^.]+)',expand=False)

Output:

0    ABCDEF.GHI
1     PQRST.GHI
2     JKLMN.OPQ
3       WXY.TUV
Name: text, dtype: object
Answered By: Quang Hoang

Try (regex101):

import re

strings = [
    "1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
    "12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
    "1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]

pat = re.compile(r"[^.]+_([^.]+.[^.]+)")

for s in strings:
    print(pat.search(s).group(1))

Prints:

ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
Answered By: Andrej Kesely

You can also do it with rsplit(). Specify maxsplit, so that you don’t split more than you need to (for efficiency):

[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

If there are strings with less than 2 dots and each returned string should have one dot in it, then add a ternary operator that splits (or not) depending on the number of dots in the string.

[x.rsplit('.', maxsplit=1)[0] if x.count('.') > 1 else x 
 for s in strings
 for x in [s.rsplit('_', maxsplit=1)[1]]]

# ['GHI.xlsx', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']