How to normalize text with regex?
Question:
How to normilize text with regex with some if statements?
If we have string like this
One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1
And I want to normilize like this
one t 933 two three 35.4 four 9,3 8.5 five m2x13 m4.3x2.1
- Remove all dots and commas.
- Split number and string if not starts with letter ‘M’
T933
–> T 933
- All lowercase
- Do not split if there is dot or comma between numbers
35.4
–> 35.4
or 9,3
–> 9.3
if there is comma between, then replace to dot
What I am able to do is this
def process(str, **kwargs):
str = str.replace(',', '.')
str = re.split(r'(-?d*.?d+)', str)
str = ' '.join(str)
str.lower()
return str
but there is no if condition when numbers starts with letter ‘M’ and their also is splitted.
And in some reason after string process i get some unnecessary spaces.
Is there some ideas how to do that with regex? Or with help methods like replace, lower, join and so on?
Answers:
I can suggest a solution like
re.sub(r'[.,](?!(?<=d.)d)', '', re.sub(r'(?<=[^Wd_])(?<![MmXx])(?=d)|(?<=d)(?=[^Wd_])', ' ', text)).lower()
The outer re.sub
is meant to remove dots or commas when not between digits:
[.,]
– a comma or dot
(?!(?<=d.)d)
– a negative lookahead that fails the match if there is a digit immediately to the right, that is immediately preceded with a digit + any one char
The inner re.sub
replaces with a space the following pattern:
(?<=[^Wd_])(?<![MmXx])(?=d)
– a location between a letter ([^Wd_]
matches any letter) and a digit (see (?=d)
), where the letter is not M
or X
(case insensitive, [MmXx]
can be written as (?i:[mx])
)
|
– or
(?<=d)(?=[^Wd_])
– a location between a digit and a letter.
See the Python demo:
import re
text = 'One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1 aa88aa'
print( re.sub(r'[.,](?!(?<=d.)d)', '', re.sub(r'(?<=[^Wd_])(?<![MmXx])(?=d)|(?<=d)(?=[^Wd_])', ' ', text)).lower() )
Output:
one t 933 two three 35.4 four 9,3 8.5 five m2 x13 m4.3 x2.1 aa 88 aa
How to normilize text with regex with some if statements?
If we have string like this
One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1
And I want to normilize like this
one t 933 two three 35.4 four 9,3 8.5 five m2x13 m4.3x2.1
- Remove all dots and commas.
- Split number and string if not starts with letter ‘M’
T933
–>T 933
- All lowercase
- Do not split if there is dot or comma between numbers
35.4
–>35.4
or9,3
–>9.3
if there is comma between, then replace to dot
What I am able to do is this
def process(str, **kwargs):
str = str.replace(',', '.')
str = re.split(r'(-?d*.?d+)', str)
str = ' '.join(str)
str.lower()
return str
but there is no if condition when numbers starts with letter ‘M’ and their also is splitted.
And in some reason after string process i get some unnecessary spaces.
Is there some ideas how to do that with regex? Or with help methods like replace, lower, join and so on?
I can suggest a solution like
re.sub(r'[.,](?!(?<=d.)d)', '', re.sub(r'(?<=[^Wd_])(?<![MmXx])(?=d)|(?<=d)(?=[^Wd_])', ' ', text)).lower()
The outer re.sub
is meant to remove dots or commas when not between digits:
[.,]
– a comma or dot(?!(?<=d.)d)
– a negative lookahead that fails the match if there is a digit immediately to the right, that is immediately preceded with a digit + any one char
The inner re.sub
replaces with a space the following pattern:
(?<=[^Wd_])(?<![MmXx])(?=d)
– a location between a letter ([^Wd_]
matches any letter) and a digit (see(?=d)
), where the letter is notM
orX
(case insensitive,[MmXx]
can be written as(?i:[mx])
)|
– or(?<=d)(?=[^Wd_])
– a location between a digit and a letter.
See the Python demo:
import re
text = 'One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1 aa88aa'
print( re.sub(r'[.,](?!(?<=d.)d)', '', re.sub(r'(?<=[^Wd_])(?<![MmXx])(?=d)|(?<=d)(?=[^Wd_])', ' ', text)).lower() )
Output:
one t 933 two three 35.4 four 9,3 8.5 five m2 x13 m4.3 x2.1 aa 88 aa