Regular expression: match start or whitespace
Question:
Can a regular expression match whitespace or the start of a string?
I’m trying to replace currency the abbreviation GBP with a £ symbol. I could just match anything starting GBP, but I’d like to be a bit more conservative, and look for certain delimiters around it.
>>> import re
>>> text = u'GBP 5 Off when you spend GBP75.00'
>>> re.sub(ur'GBP([Wd])', ur'£g<1>', text) # matches GBP with any prefix
u'xa3 5 Off when you spend xa375.00'
>>> re.sub(ur'^GBP([Wd])', ur'£g<1>', text) # matches at start only
u'xa3 5 Off when you spend GBP75.00'
>>> re.sub(ur'(W)GBP([Wd])', ur'g<1>£g<2>', text) # matches whitespace prefix only
u'GBP 5 Off when you spend xa375.00'
Can I do both of the latter examples at the same time?
Answers:
b
is word boundary, which can be a white space, the beginning of a line or a non-alphanumeric symbol (bGBPb
).
Yes, why not?
re.sub(u'^W*GBP...
matches the start of the string, 0 or more whitespaces, then GBP…
edit: Oh, I think you want alternation, use the |
:
re.sub(u'(^|W)GBP...
You can always trim leading and trailing whitespace from the token before you search if it’s not a matching/grouping situation that requires the full line.
This replaces GBP if it’s preceded by the start of a string or a word boundary (which the start of a string already is), and after GBP comes a numeric value or a word boundary:
re.sub(u'bGBP(?=b|d)', u'£', text)
This removes the need for any unnecessary backreferencing by using a lookahead. Inclusive enough?
Use the OR “|
” operator:
>>> re.sub(r'(^|W)GBP([Wd])', u'g<1>£g<2>', text)
u'xa3 5 Off when you spend xa375.00'
I think you’re looking for '(^|W)GBP([Wd])'
It works in Perl:
$text = 'GBP 5 off when you spend GBP75';
$text =~ s/(W|^)GBP([Wd])/$1$$2/g;
printf "$textn";
The output is:
$ 5 off when you spend $75
Note that I stipulated that the match should be global, to get all occurrences.
A left-hand whitespace boundary – a position in the string that is either a string start or right after a whitespace character – can be expressed with
(?<!S) # A negative lookbehind requiring no non-whitespace char immediately to the left of the current position
(?<=s|^) # A positive lookbehind requiring a whitespace or start of string immediately to the left of the current position
(?:s|^) # A non-capturing group matching either a whitespace or start of string
(s|^) # A capturing group matching either a whitespace or start of string
See a regex demo. Python 3 demo:
import re
rx = r'(?<!S)GBP([Wd])'
text = 'GBP 5 Off when you spend GBP75.00'
print( re.sub(rx, r'£1', text) )
# => £ 5 Off when you spend £75.00
Note you may use 1
instead of g<1>
in the replacement pattern since there is no need in an unambiguous backreference when it is not followed with a digit.
BONUS: A right-hand whitespace boundary can be expressed with the following patterns:
(?!S) # A negative lookahead requiring no non-whitespace char immediately to the right of the current position
(?=s|$) # A positive lookahead requiring a whitespace or end of string immediately to the right of the current position
(?:s|$) # A non-capturing group matching either a whitespace or end of string
(s|$) # A capturing group matching either a whitespace or end of string
Can a regular expression match whitespace or the start of a string?
I’m trying to replace currency the abbreviation GBP with a £ symbol. I could just match anything starting GBP, but I’d like to be a bit more conservative, and look for certain delimiters around it.
>>> import re
>>> text = u'GBP 5 Off when you spend GBP75.00'
>>> re.sub(ur'GBP([Wd])', ur'£g<1>', text) # matches GBP with any prefix
u'xa3 5 Off when you spend xa375.00'
>>> re.sub(ur'^GBP([Wd])', ur'£g<1>', text) # matches at start only
u'xa3 5 Off when you spend GBP75.00'
>>> re.sub(ur'(W)GBP([Wd])', ur'g<1>£g<2>', text) # matches whitespace prefix only
u'GBP 5 Off when you spend xa375.00'
Can I do both of the latter examples at the same time?
b
is word boundary, which can be a white space, the beginning of a line or a non-alphanumeric symbol (bGBPb
).
Yes, why not?
re.sub(u'^W*GBP...
matches the start of the string, 0 or more whitespaces, then GBP…
edit: Oh, I think you want alternation, use the |
:
re.sub(u'(^|W)GBP...
You can always trim leading and trailing whitespace from the token before you search if it’s not a matching/grouping situation that requires the full line.
This replaces GBP if it’s preceded by the start of a string or a word boundary (which the start of a string already is), and after GBP comes a numeric value or a word boundary:
re.sub(u'bGBP(?=b|d)', u'£', text)
This removes the need for any unnecessary backreferencing by using a lookahead. Inclusive enough?
Use the OR “|
” operator:
>>> re.sub(r'(^|W)GBP([Wd])', u'g<1>£g<2>', text)
u'xa3 5 Off when you spend xa375.00'
I think you’re looking for '(^|W)GBP([Wd])'
It works in Perl:
$text = 'GBP 5 off when you spend GBP75';
$text =~ s/(W|^)GBP([Wd])/$1$$2/g;
printf "$textn";
The output is:
$ 5 off when you spend $75
Note that I stipulated that the match should be global, to get all occurrences.
A left-hand whitespace boundary – a position in the string that is either a string start or right after a whitespace character – can be expressed with
(?<!S) # A negative lookbehind requiring no non-whitespace char immediately to the left of the current position
(?<=s|^) # A positive lookbehind requiring a whitespace or start of string immediately to the left of the current position
(?:s|^) # A non-capturing group matching either a whitespace or start of string
(s|^) # A capturing group matching either a whitespace or start of string
See a regex demo. Python 3 demo:
import re
rx = r'(?<!S)GBP([Wd])'
text = 'GBP 5 Off when you spend GBP75.00'
print( re.sub(rx, r'£1', text) )
# => £ 5 Off when you spend £75.00
Note you may use 1
instead of g<1>
in the replacement pattern since there is no need in an unambiguous backreference when it is not followed with a digit.
BONUS: A right-hand whitespace boundary can be expressed with the following patterns:
(?!S) # A negative lookahead requiring no non-whitespace char immediately to the right of the current position
(?=s|$) # A positive lookahead requiring a whitespace or end of string immediately to the right of the current position
(?:s|$) # A non-capturing group matching either a whitespace or end of string
(s|$) # A capturing group matching either a whitespace or end of string