Python Regular Expression Match All 5 Digit Numbers but None Larger
Question:
I’m attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232
, 21032
, 40021
etc… I can handle the simpler case of any string of 5 digits with [0-9]{5}
, though this also matches 6, 7, 8… n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit numbers?
Answers:
Without padding the string for special case start and end of string, as in John La Rooy answer one can use the negatives lookahead and lookbehind to handle both cases with a single regular expression
>>> import re
>>> s = "88888 999999 3333 aaa 12345 hfsjkq 98765"
>>> re.findall(r"(?<!d)d{5}(?!d)", s)
['88888', '12345', '98765']
full string: ^[0-9]{5}$
within a string: [^0-9][0-9]{5}[^0-9]
A very simple way would be to match all groups of digits, like with r'd+'
, and then skip every match that isn’t five characters long when you process the results.
You probably want to match a non-digit before and after your string of 5 digits, like [^0-9]([0-9]{5})[^0-9]
. Then you can capture the inner group (the actual string you want).
>>> import re
>>> s="four digits 1234 five digits 56789 six digits 012345"
>>> re.findall(r"D(d{5})D", s)
['56789']
if they can occur at the very beginning or the very end, it’s easier to pad the string than mess with special cases
>>> re.findall(r"D(d{5})D", " "+s+" ")
You could try
Dd{5}D
or maybe
bd{5}b
I’m not sure how python treats line-endings and whitespace there though.
I believe ^d{5}$
would not work for you, as you likely want to get numbers that are somewhere within other text.
Note: There is problem in using D
since D
matches any character that is not a digit , instead use b
.
b
is important here because it matches the word boundary but only at end or beginning of a word .
import re
input = "four digits 1234 five digits 56789 six digits 01234,56789,01234"
re.findall(r"bd{5}b", input)
result : ['56789', '01234', '56789', '01234']
but if one uses
re.findall(r”D(d{5})D”, s)
output : [‘56789’, ‘01234’]
D is unable to handle comma or any continuously entered numerals.
b is important part here it matches the empty string but only at end or beginning of a word .
More documentation: https://docs.python.org/2/library/re.html
More Clarification on usage of D
vs b
:
This example uses D
but it doesn’t capture all the five digits number.
This example uses b
while capturing all five digits number.
Cheers
I use Regex with easier expression :
re.findall(r"d{5}", mystring)
It will research 5 numerical digits. But you have to be sure not to have another 5 numerical digits in the string
I’m attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232
, 21032
, 40021
etc… I can handle the simpler case of any string of 5 digits with [0-9]{5}
, though this also matches 6, 7, 8… n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit numbers?
Without padding the string for special case start and end of string, as in John La Rooy answer one can use the negatives lookahead and lookbehind to handle both cases with a single regular expression
>>> import re
>>> s = "88888 999999 3333 aaa 12345 hfsjkq 98765"
>>> re.findall(r"(?<!d)d{5}(?!d)", s)
['88888', '12345', '98765']
full string: ^[0-9]{5}$
within a string: [^0-9][0-9]{5}[^0-9]
A very simple way would be to match all groups of digits, like with r'd+'
, and then skip every match that isn’t five characters long when you process the results.
You probably want to match a non-digit before and after your string of 5 digits, like [^0-9]([0-9]{5})[^0-9]
. Then you can capture the inner group (the actual string you want).
>>> import re
>>> s="four digits 1234 five digits 56789 six digits 012345"
>>> re.findall(r"D(d{5})D", s)
['56789']
if they can occur at the very beginning or the very end, it’s easier to pad the string than mess with special cases
>>> re.findall(r"D(d{5})D", " "+s+" ")
You could try
Dd{5}D
or maybe
bd{5}b
I’m not sure how python treats line-endings and whitespace there though.
I believe ^d{5}$
would not work for you, as you likely want to get numbers that are somewhere within other text.
Note: There is problem in using D
since D
matches any character that is not a digit , instead use b
.
b
is important here because it matches the word boundary but only at end or beginning of a word .
import re
input = "four digits 1234 five digits 56789 six digits 01234,56789,01234"
re.findall(r"bd{5}b", input)
result : ['56789', '01234', '56789', '01234']
but if one uses
re.findall(r”D(d{5})D”, s)
output : [‘56789’, ‘01234’]
D is unable to handle comma or any continuously entered numerals.
b is important part here it matches the empty string but only at end or beginning of a word .
More documentation: https://docs.python.org/2/library/re.html
More Clarification on usage of D
vs b
:
This example uses D
but it doesn’t capture all the five digits number.
This example uses b
while capturing all five digits number.
Cheers
I use Regex with easier expression :
re.findall(r"d{5}", mystring)
It will research 5 numerical digits. But you have to be sure not to have another 5 numerical digits in the string