Regex negative lookahead string with special character python

Question:

It’s about content dimensions on a website. This link checker tool supports Python Regex. With the link checker I want to get information about just one content dimension.

I’d like to match all except the one with the string de_de (for the --no-follow-url option).

https://www.example.com/int_en
https://www.example.com/int_de
https://www.example.com/de_de  ##should not match or all others should match
https://www.example.com/be_de
https://www.example.com/fr_fr
https://www.example.com/gb_en
https://www.example.com/us_en
https://www.example.com/ch_de
https://www.example.com/ch_it
https://www.example.com/shop

I’m stuck somewhere inbetween these approaches:

https://www.example.com/bde_de
https://www.example.com/[^de]{2,3}[^de]
https://www.example.com/[a-z]{2,3}_[^d][^e]
https://www.example.com/([a-z]{2,3}_)(?!^de$)
https://www.example.com/[a-z]{2,3}_
https://www.example.com/(?!^de_de$)

How can I use a negative lookahead to match a string with a special character (underscore)? Can I go with something like

(?!^de_de$)

I’m new to regex, any help or input is appreciated.

Asked By: Sevi S.

||

Answers:

You could try:

https://www.example.com/.+?(?<!de_de)b

This matches:

https://www.example.com/shop

but not:

https://www.example.com/de_de

Pythex link here

Explanation: here we use a negative look behind (?<!de_de) applied to a word boundary (b). This means that we have to find a word boundary not preceded by “de_de”.

Answered By: gil.fernandes

Use the following regex:

https://www.example.com/(?!de_de(?:/|$))[a-z_]+

See the regex demo. If you also want to match http, add s? after http in the pattern, https?://www.example.com/(?!de_de(?:/|$))[a-z_]+.

Note you should escape the dots to match the real literal dots in the string. The (?!de_de(?:/|$))[a-z_]+ part matches any 1+ letters/underscores (see [a-z_]+) that are not de_de that is followed with / or end of string.

Python demo:

import re
ex = ["https://www.example.com/int_en","https://www.example.com/int_de","https://www.example.com/de_de","https://www.example.com/be_de","https://www.example.com/de_en","https://www.example.com/fr_en","https://www.example.com/fr_fr","https://www.example.com/gb_en","https://www.example.com/us_en","https://www.example.com/ch_de","https://www.example.com/ch_it"]
rx = r"https://www.example.com/(?!de_de(?:/|$))[a-z_]+"
for s in ex:
    m = re.search(rx, s)
    if m:
        print("{} => MATCHED".format(s))
    else:
        print("{} => NOT MATCHED".format(s))

Output:

https://www.example.com/int_en => MATCHED
https://www.example.com/int_de => MATCHED
https://www.example.com/de_de => NOT MATCHED
https://www.example.com/be_de => MATCHED
https://www.example.com/de_en => MATCHED
https://www.example.com/fr_en => MATCHED
https://www.example.com/fr_fr => MATCHED
https://www.example.com/gb_en => MATCHED
https://www.example.com/us_en => MATCHED
https://www.example.com/ch_de => MATCHED
https://www.example.com/ch_it => MATCHED
Answered By: Wiktor Stribiżew

basically you want to exclude german versions of websites.
so I would go with smth like this:

import re
r = re.compile(r'(https?://|www.)[^/]+/(?!de_de)S+')

since it would also work for cases:

Answered By: ash17
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.