How to normalize text with regex?

Question

How to normilize text with regex with some if statements?

If we have string like this
One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1

And I want to normilize like this
one t 933 two three 35.4 four 9,3 8.5 five m2x13 m4.3x2.1

Remove all dots and commas.
Split number and string if not starts with letter ‘M’ T933 –> T 933
All lowercase
Do not split if there is dot or comma between numbers 35.4 –> 35.4 or 9,3 –> 9.3 if there is comma between, then replace to dot

What I am able to do is this

def process(str, **kwargs):
    str = str.replace(',', '.')
    str = re.split(r'(-?d*.?d+)', str)
    str = ' '.join(str)
    str.lower()
    return str

but there is no if condition when numbers starts with letter ‘M’ and their also is splitted.
And in some reason after string process i get some unnecessary spaces.

Is there some ideas how to do that with regex? Or with help methods like replace, lower, join and so on?

Asked By: Dmiich

||

Source

Answer 1

I can suggest a solution like

re.sub(r'[.,](?!(?<=d.)d)', '', re.sub(r'(?<=[^Wd_])(?<![MmXx])(?=d)|(?<=d)(?=[^Wd_])', ' ', text)).lower()

The outer re.sub is meant to remove dots or commas when not between digits:

[.,] – a comma or dot
(?!(?<=d.)d) – a negative lookahead that fails the match if there is a digit immediately to the right, that is immediately preceded with a digit + any one char

The inner re.sub replaces with a space the following pattern:

(?<=[^Wd_])(?<![MmXx])(?=d) – a location between a letter ([^Wd_] matches any letter) and a digit (see (?=d)), where the letter is not M or X (case insensitive, [MmXx] can be written as (?i:[mx]))
| – or
(?<=d)(?=[^Wd_]) – a location between a digit and a letter.

See the Python demo:

import re
text = 'One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1 aa88aa'
print( re.sub(r'[.,](?!(?<=d.)d)', '', re.sub(r'(?<=[^Wd_])(?<![MmXx])(?=d)|(?<=d)(?=[^Wd_])', ' ', text)).lower() )

Output:

one t 933 two three 35.4 four 9,3 8.5 five m2 x13 m4.3 x2.1 aa 88 aa

Answered By: Wiktor Stribiżew

How to normalize text with regex?

Question:

Answers: