Regex negative lookahead string with special character python
Question:
It’s about content dimensions on a website. This link checker tool supports Python Regex. With the link checker I want to get information about just one content dimension.
I’d like to match all except the one with the string de_de
(for the --no-follow-url
option).
https://www.example.com/int_en
https://www.example.com/int_de
https://www.example.com/de_de ##should not match or all others should match
https://www.example.com/be_de
https://www.example.com/fr_fr
https://www.example.com/gb_en
https://www.example.com/us_en
https://www.example.com/ch_de
https://www.example.com/ch_it
https://www.example.com/shop
I’m stuck somewhere inbetween these approaches:
https://www.example.com/bde_de
https://www.example.com/[^de]{2,3}[^de]
https://www.example.com/[a-z]{2,3}_[^d][^e]
https://www.example.com/([a-z]{2,3}_)(?!^de$)
https://www.example.com/[a-z]{2,3}_
https://www.example.com/(?!^de_de$)
How can I use a negative lookahead to match a string with a special character (underscore)? Can I go with something like
(?!^de_de$)
I’m new to regex, any help or input is appreciated.
Answers:
You could try:
https://www.example.com/.+?(?<!de_de)b
This matches:
https://www.example.com/shop
but not:
https://www.example.com/de_de
Pythex link here
Explanation: here we use a negative look behind (?<!de_de)
applied to a word boundary (b
). This means that we have to find a word boundary not preceded by “de_de”.
Use the following regex:
https://www.example.com/(?!de_de(?:/|$))[a-z_]+
See the regex demo. If you also want to match http
, add s?
after http
in the pattern, https?://www.example.com/(?!de_de(?:/|$))[a-z_]+
.
Note you should escape the dots to match the real literal dots in the string. The (?!de_de(?:/|$))[a-z_]+
part matches any 1+ letters/underscores (see [a-z_]+
) that are not de_de
that is followed with /
or end of string.
import re
ex = ["https://www.example.com/int_en","https://www.example.com/int_de","https://www.example.com/de_de","https://www.example.com/be_de","https://www.example.com/de_en","https://www.example.com/fr_en","https://www.example.com/fr_fr","https://www.example.com/gb_en","https://www.example.com/us_en","https://www.example.com/ch_de","https://www.example.com/ch_it"]
rx = r"https://www.example.com/(?!de_de(?:/|$))[a-z_]+"
for s in ex:
m = re.search(rx, s)
if m:
print("{} => MATCHED".format(s))
else:
print("{} => NOT MATCHED".format(s))
Output:
https://www.example.com/int_en => MATCHED
https://www.example.com/int_de => MATCHED
https://www.example.com/de_de => NOT MATCHED
https://www.example.com/be_de => MATCHED
https://www.example.com/de_en => MATCHED
https://www.example.com/fr_en => MATCHED
https://www.example.com/fr_fr => MATCHED
https://www.example.com/gb_en => MATCHED
https://www.example.com/us_en => MATCHED
https://www.example.com/ch_de => MATCHED
https://www.example.com/ch_it => MATCHED
basically you want to exclude german versions of websites.
so I would go with smth like this:
import re
r = re.compile(r'(https?://|www.)[^/]+/(?!de_de)S+')
since it would also work for cases:
- https://example.com/de_de/news (links with further directories)
- http://www.example.com/ (http protocols)
- www.example.com/de-de (links lacking http/https prefixes)
- http://example.com/ (links lacking ‘www’ bit)
It’s about content dimensions on a website. This link checker tool supports Python Regex. With the link checker I want to get information about just one content dimension.
I’d like to match all except the one with the string de_de
(for the --no-follow-url
option).
https://www.example.com/int_en
https://www.example.com/int_de
https://www.example.com/de_de ##should not match or all others should match
https://www.example.com/be_de
https://www.example.com/fr_fr
https://www.example.com/gb_en
https://www.example.com/us_en
https://www.example.com/ch_de
https://www.example.com/ch_it
https://www.example.com/shop
I’m stuck somewhere inbetween these approaches:
https://www.example.com/bde_de
https://www.example.com/[^de]{2,3}[^de]
https://www.example.com/[a-z]{2,3}_[^d][^e]
https://www.example.com/([a-z]{2,3}_)(?!^de$)
https://www.example.com/[a-z]{2,3}_
https://www.example.com/(?!^de_de$)
How can I use a negative lookahead to match a string with a special character (underscore)? Can I go with something like
(?!^de_de$)
I’m new to regex, any help or input is appreciated.
You could try:
https://www.example.com/.+?(?<!de_de)b
This matches:
https://www.example.com/shop
but not:
https://www.example.com/de_de
Pythex link here
Explanation: here we use a negative look behind (?<!de_de)
applied to a word boundary (b
). This means that we have to find a word boundary not preceded by “de_de”.
Use the following regex:
https://www.example.com/(?!de_de(?:/|$))[a-z_]+
See the regex demo. If you also want to match http
, add s?
after http
in the pattern, https?://www.example.com/(?!de_de(?:/|$))[a-z_]+
.
Note you should escape the dots to match the real literal dots in the string. The (?!de_de(?:/|$))[a-z_]+
part matches any 1+ letters/underscores (see [a-z_]+
) that are not de_de
that is followed with /
or end of string.
import re
ex = ["https://www.example.com/int_en","https://www.example.com/int_de","https://www.example.com/de_de","https://www.example.com/be_de","https://www.example.com/de_en","https://www.example.com/fr_en","https://www.example.com/fr_fr","https://www.example.com/gb_en","https://www.example.com/us_en","https://www.example.com/ch_de","https://www.example.com/ch_it"]
rx = r"https://www.example.com/(?!de_de(?:/|$))[a-z_]+"
for s in ex:
m = re.search(rx, s)
if m:
print("{} => MATCHED".format(s))
else:
print("{} => NOT MATCHED".format(s))
Output:
https://www.example.com/int_en => MATCHED
https://www.example.com/int_de => MATCHED
https://www.example.com/de_de => NOT MATCHED
https://www.example.com/be_de => MATCHED
https://www.example.com/de_en => MATCHED
https://www.example.com/fr_en => MATCHED
https://www.example.com/fr_fr => MATCHED
https://www.example.com/gb_en => MATCHED
https://www.example.com/us_en => MATCHED
https://www.example.com/ch_de => MATCHED
https://www.example.com/ch_it => MATCHED
basically you want to exclude german versions of websites.
so I would go with smth like this:
import re
r = re.compile(r'(https?://|www.)[^/]+/(?!de_de)S+')
since it would also work for cases:
- https://example.com/de_de/news (links with further directories)
- http://www.example.com/ (http protocols)
- www.example.com/de-de (links lacking http/https prefixes)
- http://example.com/ (links lacking ‘www’ bit)