REGEX Match number in a line with a keyword
Question:
I tried many patterns, but cannot get the correct result. I want to match only float when the line has keyword range
at the beginning. My trouble is that the range
can follow by a :
, :
, :
, :
, :
, etc.
My best try is to use two patterns:
#1. (?i)(?<=range[: ])[:a-zA-Z0-9.$ -]+
#2. [0-9.]+
First run regex with the pattern #1, then get the ouput of pattern #1 and run regex one more time with pattern #2
How can I do that in one single pattern? Thanks so much
One more thing: my code is Python
Input:
range: $0.82
–> Expected output: 0.82
Input:
range:0.82
–> Expected output: 0.82
Input:
range: 0.82 - 0.85
–> Expected output: 0.82
, 0.85
Input:
range : 0.82 - 0.85
–> Expected output: 0.82
, 0.85
Input:
range : 0.82 - 0.85
–> Expected output: 0.82
, 0.85
Input:
range 0.82 0.85
–> Expected output: 0.82
, 0.85
Answers:
This seems to work for me – however – there are probably a number of more efficient ways of doing it:
import re
input_data = ['range: $0.82',
'range:0.82',
'range: 0.82 - 0.85',
'range : 0.82 - 0.85',
'range : 0.82 - 0.85',
'range 0.82 0.85']
for i in range(len(input_data)):
output = re.findall(r'(range)(s*:?s*[$]*)([0-9]*.[0-9]*)(s*-?s*)([0-9]*.[0-9]*)?', input_data[i])
a = list(output[0])[2]
b = list(output[0])[4]
print(f'Input: {input_data[i]} --> Expected output: {a} , {b}')
OUTPUT:
Input: range: $0.82 --> Expected output: 0.82 ,
Input: range:0.82 --> Expected output: 0.82 ,
Input: range: 0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range : 0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range : 0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range 0.82 0.85 --> Expected output: 0.82 , 0.85
You could also add some IF-statements to check to see if ‘b’ is empty, and control the output as required. However, I think the main thing that you wanted to achieve was a single REGEX statement that could extract the two numbers in question (if available).
Regex statement explanation:
r'(range)(s*:?s*[$]*)([0-9]*.[0-9]*)(s*-?s*)([0-9]*.[0-9]*)?'
First Group: (range)
This puts ‘range‘ into the first group.
Second Group: (s*:?s*[$]*)
s*
matches zero or more whitespace characters
:?
matches an optional colon (:)
[$]*
matches zero or more dollar signs ($)
Third Group: ([0-9]*.[0-9]*)
[0-9]*
matches zero or more numbers
.
matches a decimal point
- this is the group that relates to the number (0.82)
Fourth Group: (s*-?s*)
s*
matches zero or more whitespace characters
-?
matches an optional hyphen
Fifth Group: ([0-9]*.[0-9]*)?
[0-9]*
matches zero or more numbers
.
matches a decimal point
- The
?
at the end suggests that the group is optional.
- This is the group that holds the second number (0.85)
You could avoid regex completely. Those lines are not difficult to parse.
def parse(line):
if not line.startswith('range'):
return
line = line.replace(':',' ').replace('$','')
for token in line.split():
try:
yield float(token)
except ValueError:
continue
input_data = ['range: $0.82',
'range:0.82',
'range: 0.82 - 0.85',
'range : 0.82 - 0.85',
'range : 0.82 - 0.85',
'range 0.82 0.85']
r = [list(i) for i in map(parse, input_data)]
print(r)
[[0.82], [0.82], [0.82, 0.85], [0.82, 0.85], [0.82, 0.85], [0.82, 0.85]]
You could use this regex to extract your data:
^s*rangeD*(d+(?:.d+)?)(?:D*(d+(?:.d+)?))?
Regex explanation:
^
: beginning of string
s*range
: asserts the string starts with range
(possibly preceded by whitespace, if you don’t want that remove the s*
D*
: some number of non-digit characters
(d+(?:.d+)?)
: a number, captured in group 1
(?:D*(d+(?:.d+)?))?
an optional group of some non-digits followed by a number, captured in group 2
In python
import re
input_data = ['range: $0.82',
'range:0.82',
'range: 0.82 - 0.85',
'range : 0.82 - 0.85',
'range : 0.82 - 0.85',
'range 0.82 0.85']
results = [re.findall(r'^s*rangeD*(d+(?:.d+)?)(?:D*(d+(?:.d+)?))?', d)[0] for d in input_data]
print(results)
Output:
[
('0.82', ''),
('0.82', ''),
('0.82', '0.85'),
('0.82', '0.85'),
('0.82', '0.85'),
('0.82', '0.85')
]
If you can make use of the Pythonregex PyPi module Then you can get multiple occurrences:
(?<=^rangeb[s:$-d.]*)d+(?:.d+)?
Explanation
(?<=
Positive lookbehind, assert that to the left is
^rangeb
Match range
at the start of the string
[s:$-d.]*
Optionally match all allowed chars that could be in between
)
Close the lookbehind assertion
d+(?:.d+)?
Match 1+ digits with an optional decimal part
Example
import regex
strings = [
"range: $0.82",
"range:0.82",
"range: 0.82 - 0.85",
"range : 0.82 - 0.85",
"range : 0.82 - 0.85",
"range 0.82 0.85"
]
pattern = r"(?<=^rangeb[s:$-d.]*)d+(?:.d+)?"
for s in strings:
print (regex.findall(pattern, s))
Output
['0.82']
['0.82']
['0.82', '0.85']
['0.82', '0.85']
['0.82', '0.85']
['0.82', '0.85']
I tried many patterns, but cannot get the correct result. I want to match only float when the line has keyword range
at the beginning. My trouble is that the range
can follow by a :
, :
, :
, :
, :
, etc.
My best try is to use two patterns:
#1. (?i)(?<=range[: ])[:a-zA-Z0-9.$ -]+
#2. [0-9.]+
First run regex with the pattern #1, then get the ouput of pattern #1 and run regex one more time with pattern #2
How can I do that in one single pattern? Thanks so much
One more thing: my code is Python
Input:
range: $0.82
–> Expected output: 0.82
Input:
range:0.82
–> Expected output: 0.82
Input:
range: 0.82 - 0.85
–> Expected output: 0.82
, 0.85
Input:
range : 0.82 - 0.85
–> Expected output: 0.82
, 0.85
Input:
range : 0.82 - 0.85
–> Expected output: 0.82
, 0.85
Input:
range 0.82 0.85
–> Expected output: 0.82
, 0.85
This seems to work for me – however – there are probably a number of more efficient ways of doing it:
import re
input_data = ['range: $0.82',
'range:0.82',
'range: 0.82 - 0.85',
'range : 0.82 - 0.85',
'range : 0.82 - 0.85',
'range 0.82 0.85']
for i in range(len(input_data)):
output = re.findall(r'(range)(s*:?s*[$]*)([0-9]*.[0-9]*)(s*-?s*)([0-9]*.[0-9]*)?', input_data[i])
a = list(output[0])[2]
b = list(output[0])[4]
print(f'Input: {input_data[i]} --> Expected output: {a} , {b}')
OUTPUT:
Input: range: $0.82 --> Expected output: 0.82 ,
Input: range:0.82 --> Expected output: 0.82 ,
Input: range: 0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range : 0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range : 0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range 0.82 0.85 --> Expected output: 0.82 , 0.85
You could also add some IF-statements to check to see if ‘b’ is empty, and control the output as required. However, I think the main thing that you wanted to achieve was a single REGEX statement that could extract the two numbers in question (if available).
Regex statement explanation:
r'(range)(s*:?s*[$]*)([0-9]*.[0-9]*)(s*-?s*)([0-9]*.[0-9]*)?'
First Group: (range)
This puts ‘range‘ into the first group.
Second Group: (s*:?s*[$]*)
s*
matches zero or more whitespace characters:?
matches an optional colon (:)[$]*
matches zero or more dollar signs ($)
Third Group: ([0-9]*.[0-9]*)
[0-9]*
matches zero or more numbers.
matches a decimal point- this is the group that relates to the number (0.82)
Fourth Group: (s*-?s*)
s*
matches zero or more whitespace characters-?
matches an optional hyphen
Fifth Group: ([0-9]*.[0-9]*)?
[0-9]*
matches zero or more numbers.
matches a decimal point- The
?
at the end suggests that the group is optional. - This is the group that holds the second number (0.85)
You could avoid regex completely. Those lines are not difficult to parse.
def parse(line):
if not line.startswith('range'):
return
line = line.replace(':',' ').replace('$','')
for token in line.split():
try:
yield float(token)
except ValueError:
continue
input_data = ['range: $0.82',
'range:0.82',
'range: 0.82 - 0.85',
'range : 0.82 - 0.85',
'range : 0.82 - 0.85',
'range 0.82 0.85']
r = [list(i) for i in map(parse, input_data)]
print(r)
[[0.82], [0.82], [0.82, 0.85], [0.82, 0.85], [0.82, 0.85], [0.82, 0.85]]
You could use this regex to extract your data:
^s*rangeD*(d+(?:.d+)?)(?:D*(d+(?:.d+)?))?
Regex explanation:
^
: beginning of strings*range
: asserts the string starts withrange
(possibly preceded by whitespace, if you don’t want that remove thes*
D*
: some number of non-digit characters(d+(?:.d+)?)
: a number, captured in group 1(?:D*(d+(?:.d+)?))?
an optional group of some non-digits followed by a number, captured in group 2
In python
import re
input_data = ['range: $0.82',
'range:0.82',
'range: 0.82 - 0.85',
'range : 0.82 - 0.85',
'range : 0.82 - 0.85',
'range 0.82 0.85']
results = [re.findall(r'^s*rangeD*(d+(?:.d+)?)(?:D*(d+(?:.d+)?))?', d)[0] for d in input_data]
print(results)
Output:
[
('0.82', ''),
('0.82', ''),
('0.82', '0.85'),
('0.82', '0.85'),
('0.82', '0.85'),
('0.82', '0.85')
]
If you can make use of the Pythonregex PyPi module Then you can get multiple occurrences:
(?<=^rangeb[s:$-d.]*)d+(?:.d+)?
Explanation
(?<=
Positive lookbehind, assert that to the left is^rangeb
Matchrange
at the start of the string[s:$-d.]*
Optionally match all allowed chars that could be in between
)
Close the lookbehind assertiond+(?:.d+)?
Match 1+ digits with an optional decimal part
Example
import regex
strings = [
"range: $0.82",
"range:0.82",
"range: 0.82 - 0.85",
"range : 0.82 - 0.85",
"range : 0.82 - 0.85",
"range 0.82 0.85"
]
pattern = r"(?<=^rangeb[s:$-d.]*)d+(?:.d+)?"
for s in strings:
print (regex.findall(pattern, s))
Output
['0.82']
['0.82']
['0.82', '0.85']
['0.82', '0.85']
['0.82', '0.85']
['0.82', '0.85']