Cannot seem to figure out this regex involving forward slash

Question:

I am trying to capture instances in my dataframe where a string has the following format:

/random a/random b/random c/capture this/random again/random/random

Where a string is preceded by four instances of /, and more than two / appear after it, I would like the string captured and returned in a different column. If it is not applicable to that row, return None.

In this instance capture this should be captured and placed into a new column.

This is what I tried:

def extract_special_string(df, column):
    df['special_string_a'] = df[column].apply(lambda x: re.search(r'(?<=/{4})[^/]+(?=/[^/]{2,})', x).group(0) if re.search(r'(?<=/{4})[^/]+(?=/[^/]{2,})', x) else None)

extract_special_string(df, 'column')

However nothing is being captured. Can anybody help with this regex? Thanks.

Asked By: work_python

||

Answers:

You can use

df['special_string_a'] = df[column].str.extract(r'^(?:[^/]*/){4}([^/]+)(?:/[^/]*){2}', expand=False)

See the regex demo

Details:

  • ^ – start of string
  • (?:[^/]*/){4} – four occurrences of any zero or more chars other than / and then a / char
  • ([^/]+) – Capturing group 1:one or more chars other than a / char
  • (?:/[^/]*){2} – two occurrences of a / char and then any zero or more chars other than /.
Answered By: Wiktor Stribiżew

An alternative regex approach would be to use non-greedy quantifiers.

import re
s = '/random a/random b/random c/capture this/random again/random/random'
pattern = r'/(?:.*?/){3}(.*?)(?:/.*?){2,}'
m = re.match(pattern, s)
print(m.group(1))  # 'capture this'
  • /(?:.*?/){3} – match the part before capture this, matching any but as few as possible characters between each pair of /s (use noncapturing group to ignore the contents)
  • (.*?) – capture capture this (since this is a capturing group, we can fetch the contents from <match_object>.group(1)
  • (?:/.*?){2,} – same as the first part, match as few characters as possible in between each pair of /s
Answered By: Fractalism
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.