Python Regex: Chain Multiple Substitutions In Function

Question:

I am looking to clean URL’s in the else block of an if. Specifically, strip the ? and all query parameters after it as well as everything before the first "/".

Example Input = ‘somesite.com/somepage?param=1&else=2’

Example Output = ‘somepage’

** All that is left is our page (no query params and no domain) **

Below is what I have so far (not working). I was focused on piecing this out and the below was an attempt on stripping all query parameters. I’m not sure how I would chain both together.

def new_url_check(x):
    
    if 'some condition' in x:
        x = 'some random condition'           
            
    else:
        re.sub(r'^([^?]+)', '', x)
            
    return x
Asked By: Carson Whitley

||

Answers:

howabout using re.search and setting what you want aside as a group?

re.search(r'.*.com/(.*)?.*', x).group(1)
Answered By: Andrew Lien

You’ll have to assign to x or return the result of your re.sub call, and the regex should include some requirements that concern the forward slash. Also there is the # symbol that has a special meaning:

x = re.sub(r'^.*/|[?#].*', '', x)

If you want to keep the first part of the path instead of the last, then:

x = re.sub(r'^.*?/+.*?/|[/?#].*', '', x)

This assumes that the host is included in the input, and starts with at least a forward slash. So it will work for the following:

http://localhost/abc/def/ghi  => abc
/localhost/abc/def/ghi => abc
Answered By: trincot
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.