REGEX Match number in a line with a keyword

Question:

I tried many patterns, but cannot get the correct result. I want to match only float when the line has keyword range at the beginning. My trouble is that the range can follow by a :, : , :, :, : , etc.

My best try is to use two patterns:

#1. (?i)(?<=range[: ])[:a-zA-Z0-9.$ -]+

#2. [0-9.]+

First run regex with the pattern #1, then get the ouput of pattern #1 and run regex one more time with pattern #2

How can I do that in one single pattern? Thanks so much

One more thing: my code is Python

Input:
range: $0.82
–> Expected output: 0.82

Input:
range:0.82
–> Expected output: 0.82

Input:
range: 0.82 - 0.85
–> Expected output: 0.82, 0.85

Input:
range : 0.82 - 0.85
–> Expected output: 0.82, 0.85

Input:
range : 0.82 - 0.85
–> Expected output: 0.82, 0.85

Input:
range 0.82 0.85
–> Expected output: 0.82, 0.85

Asked By: Triho

||

Answers:

This seems to work for me – however – there are probably a number of more efficient ways of doing it:

import re

input_data = ['range: $0.82',
              'range:0.82',
              'range:  0.82 - 0.85',
              'range : 0.82 - 0.85',
              'range   :  0.82 - 0.85',
              'range 0.82   0.85']

for i in range(len(input_data)):
    output = re.findall(r'(range)(s*:?s*[$]*)([0-9]*.[0-9]*)(s*-?s*)([0-9]*.[0-9]*)?', input_data[i])
    a = list(output[0])[2]
    b = list(output[0])[4]
    print(f'Input: {input_data[i]} --> Expected output: {a} , {b}')

OUTPUT:

Input: range: $0.82 --> Expected output: 0.82 , 
Input: range:0.82 --> Expected output: 0.82 , 
Input: range:  0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range : 0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range   :  0.82 - 0.85 --> Expected output: 0.82 , 0.85
Input: range 0.82   0.85 --> Expected output: 0.82 , 0.85

You could also add some IF-statements to check to see if ‘b’ is empty, and control the output as required. However, I think the main thing that you wanted to achieve was a single REGEX statement that could extract the two numbers in question (if available).

Regex statement explanation:

r'(range)(s*:?s*[$]*)([0-9]*.[0-9]*)(s*-?s*)([0-9]*.[0-9]*)?'

First Group: (range)

This puts ‘range‘ into the first group.

Second Group: (s*:?s*[$]*)

  • s* matches zero or more whitespace characters
  • :? matches an optional colon (:)
  • [$]* matches zero or more dollar signs ($)

Third Group: ([0-9]*.[0-9]*)

  • [0-9]* matches zero or more numbers
  • . matches a decimal point
  • this is the group that relates to the number (0.82)

Fourth Group: (s*-?s*)

  • s* matches zero or more whitespace characters
  • -? matches an optional hyphen

Fifth Group: ([0-9]*.[0-9]*)?

  • [0-9]* matches zero or more numbers
  • . matches a decimal point
  • The ? at the end suggests that the group is optional.
  • This is the group that holds the second number (0.85)
Answered By: ScottC

You could avoid regex completely. Those lines are not difficult to parse.

def parse(line):
    if not line.startswith('range'):
        return
    line = line.replace(':',' ').replace('$','')
    for token in line.split():
        try:
            yield float(token)
        except ValueError:
            continue
            

input_data = ['range: $0.82',
              'range:0.82',
              'range:  0.82 - 0.85',
              'range : 0.82 - 0.85',
              'range   :  0.82 - 0.85',
              'range 0.82   0.85']

r = [list(i) for i in map(parse, input_data)]
print(r)
[[0.82], [0.82], [0.82, 0.85], [0.82, 0.85], [0.82, 0.85], [0.82, 0.85]]
Answered By: alec_djinn

You could use this regex to extract your data:

^s*rangeD*(d+(?:.d+)?)(?:D*(d+(?:.d+)?))?

Regex explanation:

  • ^ : beginning of string
  • s*range : asserts the string starts with range (possibly preceded by whitespace, if you don’t want that remove the s*
  • D* : some number of non-digit characters
  • (d+(?:.d+)?) : a number, captured in group 1
  • (?:D*(d+(?:.d+)?))? an optional group of some non-digits followed by a number, captured in group 2

In python

import re

input_data = ['range: $0.82',
              'range:0.82',
              'range:  0.82 - 0.85',
              'range : 0.82 - 0.85',
              'range   :  0.82 - 0.85',
              'range 0.82   0.85']
results = [re.findall(r'^s*rangeD*(d+(?:.d+)?)(?:D*(d+(?:.d+)?))?', d)[0] for d in input_data]
print(results)

Output:

[
 ('0.82', ''),
 ('0.82', ''),
 ('0.82', '0.85'),
 ('0.82', '0.85'),
 ('0.82', '0.85'),
 ('0.82', '0.85')
]
Answered By: Nick

If you can make use of the Pythonregex PyPi module Then you can get multiple occurrences:

(?<=^rangeb[s:$-d.]*)d+(?:.d+)?

Explanation

  • (?<= Positive lookbehind, assert that to the left is
    • ^rangeb Match range at the start of the string
    • [s:$-d.]* Optionally match all allowed chars that could be in between
  • ) Close the lookbehind assertion
  • d+(?:.d+)? Match 1+ digits with an optional decimal part

Regex demo | Python demo

Example

import regex

strings = [
"range: $0.82",
"range:0.82",
"range:  0.82 - 0.85",
"range : 0.82 - 0.85",
"range   :  0.82 - 0.85",
"range 0.82   0.85"
]
pattern = r"(?<=^rangeb[s:$-d.]*)d+(?:.d+)?"

for s in strings:
    print (regex.findall(pattern, s))

Output

['0.82']
['0.82']
['0.82', '0.85']
['0.82', '0.85']
['0.82', '0.85']
['0.82', '0.85']
Answered By: The fourth bird
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.