pandas regex look ahead and behind from a 1st occurrence of character
Question:
I have python strings like below
"1234_4534_41247612_2462184_2131_GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"
I would like to do the below
a) extract characters that appear before and after 1st dot
b) The keywords that I want are always found after the last _
symbol
For ex: If you look at 2nd input string, I would like to get only PQRST.GHI
as output. It is after last _
and before 1st .
and we also get keyword after 1st .
So, I tried the below
for s in strings:
after_part = (s.split('.')[1])
before_part = (s.split('.')[0])
before_part = qnd_part.split('_')[-1]
expected_keyword = before_part + "." + after_part
print(expected_keyword)
Though this works, this is definitely not nice and elegant way to write a regex.
Is there any other better way to write this?
I expect my output to be like as below. As you can see that we get keywords before and after 1st dot
character
GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
Answers:
You can do (try the pattern here )
df['text'].str.extract('_([^._]+.[^.]+)',expand=False)
Output:
0 ABCDEF.GHI
1 PQRST.GHI
2 JKLMN.OPQ
3 WXY.TUV
Name: text, dtype: object
Try (regex101):
import re
strings = [
"1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]
pat = re.compile(r"[^.]+_([^.]+.[^.]+)")
for s in strings:
print(pat.search(s).group(1))
Prints:
ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
You can also do it with rsplit()
. Specify maxsplit
, so that you don’t split more than you need to (for efficiency):
[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']
If there are strings with less than 2 dots and each returned string should have one dot in it, then add a ternary operator that splits (or not) depending on the number of dots in the string.
[x.rsplit('.', maxsplit=1)[0] if x.count('.') > 1 else x
for s in strings
for x in [s.rsplit('_', maxsplit=1)[1]]]
# ['GHI.xlsx', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']
I have python strings like below
"1234_4534_41247612_2462184_2131_GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"
I would like to do the below
a) extract characters that appear before and after 1st dot
b) The keywords that I want are always found after the last _
symbol
For ex: If you look at 2nd input string, I would like to get only PQRST.GHI
as output. It is after last _
and before 1st .
and we also get keyword after 1st .
So, I tried the below
for s in strings:
after_part = (s.split('.')[1])
before_part = (s.split('.')[0])
before_part = qnd_part.split('_')[-1]
expected_keyword = before_part + "." + after_part
print(expected_keyword)
Though this works, this is definitely not nice and elegant way to write a regex.
Is there any other better way to write this?
I expect my output to be like as below. As you can see that we get keywords before and after 1st dot
character
GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
You can do (try the pattern here )
df['text'].str.extract('_([^._]+.[^.]+)',expand=False)
Output:
0 ABCDEF.GHI
1 PQRST.GHI
2 JKLMN.OPQ
3 WXY.TUV
Name: text, dtype: object
Try (regex101):
import re
strings = [
"1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]
pat = re.compile(r"[^.]+_([^.]+.[^.]+)")
for s in strings:
print(pat.search(s).group(1))
Prints:
ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV
You can also do it with rsplit()
. Specify maxsplit
, so that you don’t split more than you need to (for efficiency):
[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']
If there are strings with less than 2 dots and each returned string should have one dot in it, then add a ternary operator that splits (or not) depending on the number of dots in the string.
[x.rsplit('.', maxsplit=1)[0] if x.count('.') > 1 else x
for s in strings
for x in [s.rsplit('_', maxsplit=1)[1]]]
# ['GHI.xlsx', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']