Cannot seem to figure out this regex involving forward slash
Question:
I am trying to capture instances in my dataframe where a string has the following format:
/random a/random b/random c/capture this/random again/random/random
Where a string is preceded by four instances of /
, and more than two /
appear after it, I would like the string captured and returned in a different column. If it is not applicable to that row, return None
.
In this instance capture this
should be captured and placed into a new column.
This is what I tried:
def extract_special_string(df, column):
df['special_string_a'] = df[column].apply(lambda x: re.search(r'(?<=/{4})[^/]+(?=/[^/]{2,})', x).group(0) if re.search(r'(?<=/{4})[^/]+(?=/[^/]{2,})', x) else None)
extract_special_string(df, 'column')
However nothing is being captured. Can anybody help with this regex? Thanks.
Answers:
You can use
df['special_string_a'] = df[column].str.extract(r'^(?:[^/]*/){4}([^/]+)(?:/[^/]*){2}', expand=False)
See the regex demo
Details:
^
– start of string
(?:[^/]*/){4}
– four occurrences of any zero or more chars other than /
and then a /
char
([^/]+)
– Capturing group 1:one or more chars other than a /
char
(?:/[^/]*){2}
– two occurrences of a /
char and then any zero or more chars other than /
.
An alternative regex approach would be to use non-greedy quantifiers.
import re
s = '/random a/random b/random c/capture this/random again/random/random'
pattern = r'/(?:.*?/){3}(.*?)(?:/.*?){2,}'
m = re.match(pattern, s)
print(m.group(1)) # 'capture this'
/(?:.*?/){3}
– match the part before capture this
, matching any but as few as possible characters between each pair of /
s (use noncapturing group to ignore the contents)
(.*?)
– capture capture this
(since this is a capturing group, we can fetch the contents from <match_object>.group(1)
(?:/.*?){2,}
– same as the first part, match as few characters as possible in between each pair of /
s
I am trying to capture instances in my dataframe where a string has the following format:
/random a/random b/random c/capture this/random again/random/random
Where a string is preceded by four instances of /
, and more than two /
appear after it, I would like the string captured and returned in a different column. If it is not applicable to that row, return None
.
In this instance capture this
should be captured and placed into a new column.
This is what I tried:
def extract_special_string(df, column):
df['special_string_a'] = df[column].apply(lambda x: re.search(r'(?<=/{4})[^/]+(?=/[^/]{2,})', x).group(0) if re.search(r'(?<=/{4})[^/]+(?=/[^/]{2,})', x) else None)
extract_special_string(df, 'column')
However nothing is being captured. Can anybody help with this regex? Thanks.
You can use
df['special_string_a'] = df[column].str.extract(r'^(?:[^/]*/){4}([^/]+)(?:/[^/]*){2}', expand=False)
See the regex demo
Details:
^
– start of string(?:[^/]*/){4}
– four occurrences of any zero or more chars other than/
and then a/
char([^/]+)
– Capturing group 1:one or more chars other than a/
char(?:/[^/]*){2}
– two occurrences of a/
char and then any zero or more chars other than/
.
An alternative regex approach would be to use non-greedy quantifiers.
import re
s = '/random a/random b/random c/capture this/random again/random/random'
pattern = r'/(?:.*?/){3}(.*?)(?:/.*?){2,}'
m = re.match(pattern, s)
print(m.group(1)) # 'capture this'
/(?:.*?/){3}
– match the part beforecapture this
, matching any but as few as possible characters between each pair of/
s (use noncapturing group to ignore the contents)(.*?)
– capturecapture this
(since this is a capturing group, we can fetch the contents from<match_object>.group(1)
(?:/.*?){2,}
– same as the first part, match as few characters as possible in between each pair of/
s