regex match returning none

Question:

I know similar questions like this have already been asked on the platform but I checked them and did not find the help I needed.

I have some String such as :

path = "most popular data structure
in OOP lists/5438_133195_9917949_1218833?
povid=racking these benchmarks"

path = "activewear/2356_15890_9397775?
povid=ApparelNavpopular data structure
you to be informed when a regression"

I have a function :

def extract_id(path):
    pattern = re.compile(r"([0-9]+(_[0-9]+)+)", re.IGNORECASE)
    return pattern.match(path)

The expected results are 5438_133195_9917949_1218833 and 2356_15890_9397775. I tested the function online, and it seems to produce the expected result but my it’s returning None in my app. What am I doing wrong?
Thanks.

Asked By: Ktrel

||

Answers:

match is used to match an entire statement. What you want is search. You have to use group to retrieve matches from a search. You don’t need re.IGNORECASE if you are looking for characters that don’t have a case. You should compile your regex only once. Compiling a pattern that never changes, every time a function is called, is not optimal.

You could simplify your expression to ((d+_?)+)?, which will find a repeating sequence of one or more digits that may be followed by an underscore, and is ultimately ended with a question mark

example:

import re

#do this once
pathid = re.compile(r'((d+_?)+)?') 

def extract_id(path:str) -> str:
    if m := pathid.search(path): #make sure there is a match
        return m.group(1)        #return match from group 1 `((d+_?)+)`
    return None                  #no match

#use
path   = "thingsbefore/5438_133195_9917949_1218833?thingsafter"
result = extract_id(path)

#proof
print(result) #5438_133195_9917949_1218833

python regex docs

Your id comes after the last / and before the ?. The below solution will likely be much faster. This doesn’t search by pattern, it prunes by position.

def extract_id(path:str) -> str:
    #right of the last / to left of the ?
    return path.split('/')[-1].split('?')[0]

#use
path   = "thingsbefore/5438_133195_9917949_1218833?thingsafter"
result = extract_id(path)

#proof
print(result) #5438_133195_9917949_1218833
Answered By: OneMadGypsy

You don’t need any capture groups, you can get a match only and return .group() using re.seach:

bd+(?:_d+)+b
  • b A word boundary
  • d+ Match 1+ digits
  • (?:_d+)+ Repeat 1+ times _ and 1+ digits
  • b A word boundary

Regex demo

import re

path = "most popular data structure in OOP lists/5438_133195_9917949_1218833? povid=racking these benchmarks"
pattern = re.compile(r"bd+(?:_d+)+b")
def extract_id(path):
    return pattern.search(path).group()

print(extract_id(path))

Output

5438_133195_9917949_1218833
Answered By: The fourth bird
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.