Extracting float or int number and substring from a string
Question:
I’ve just learned regex in python3 and was trying to solve a problem.
The problem is something like this:
You have given a string where the first part is a float or integer number and the next part is a substring. You must split the number and the substring and return it as a list. The substring will only contain the alphabet from a-z and A-Z. The values of numbers can be negative.
For example:
- Input: 2.5ax
Output:[‘2.5′,’ax’]
- Input: -5bcf
Output:[‘-5′,’bcf’]
- Input:-69.67Gh
Output:[‘-69.67′,’Gh’]
and so on.
I did several attempts with regex to solve the problem.
1st attempt:
import re
i=input()
print(re.findall(r'^(-?d+(.d+)?)|[a-zA-Z]+$',i))
For the input -2.55xy, the expected output was [‘-2.55′,’xy’]
But the output came:
[(‘-2.55’, ‘.55’), (”, ”)]
2nd attempt:
My second attempt was similar to my first attempt just a little different:
import re
i=input()
print(re.findall(r'^(-?(d+.d+)|d+)|[a-zA-Z]+$',i))
For the same input -2.55xy, the output came as:
[(‘-2.55’, ‘2.55’), (”, ”)]
3rd attempt:
My next attempt was like that:
import re
i=input()
print(re.findall(r'^-?[1-9.]+|[a-z|A-Z]+$',i))
which matched the expected output for -2.55xy and also with the sample examples. But when the input is 2..5 or something like that, it considers that also as a float.
4th attempt:
import re
i=input()
value=re.findall(r"[a-zA-Z]+",i)
print([i.replace(value[0],""),value[0]])
which also matches the expected output but has the same problem as 3rd one that goes with it. Also, it doesn’t look like an effective way to do it.
Conclusion:
So I don’t know why my 1st and 2nd attempt isn’t working. The output comes with a list of tuples which is maybe because of the groups but I don’t know the exact reason and don’t know how to solve them. Maybe I didn’t understand the way the pattern works. Also why the substring didn’t show in the output?
In the end, I want to know what’s the mistake in my code and how can I write better and more efficient code to solve the problem. Thank you and sorry for my bad English.
Answers:
The alternation |
matches either the left part or the right part.
If the chars a-zA-Z are after the digit, you don’t need the alternation |
and you can use 2 capture groups to get the matches in that order.
Then using re.findall will return a list of tuples for the capture group values.
(-?d+(?:.d+)?)([a-zA-Z]+)
Explanation
(
Capture group 1
-?d+
Match an optional –
(?:.d+)?
Optionally match .
and 1+ digits using a non capture group (so it is not outputted separately by re.findall)
)
Close group 1
(
Capture group 2
[a-zA-Z]+
Match 1+ times a char a-z or A-Z
)
Close group 2
import re
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
pattern = r"(-?d+(?:.d+)?)([a-zA-Z]+)"
for s in strings:
print(re.findall(pattern, s))
Output
[('2.5', 'ax')]
[('-5', 'bcf')]
[('-69.67', 'Gh')]
lookahead and lookbehind in re.sub simplify things sometimes.
- (?<=d) look behind
- (?=[a-zA-Z]) look ahead
that is split between the digit and the letter.
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
for s in strings:
print(re.split(r'(?<=d)(?=[a-zA-Z])', s))
['2.5', 'ax']
['-5', 'bcf']
['-69.67', 'Gh']
I’ve just learned regex in python3 and was trying to solve a problem.
The problem is something like this:
You have given a string where the first part is a float or integer number and the next part is a substring. You must split the number and the substring and return it as a list. The substring will only contain the alphabet from a-z and A-Z. The values of numbers can be negative.
For example:
- Input: 2.5ax
Output:[‘2.5′,’ax’]- Input: -5bcf
Output:[‘-5′,’bcf’]- Input:-69.67Gh
Output:[‘-69.67′,’Gh’]
and so on.
I did several attempts with regex to solve the problem.
1st attempt:
import re
i=input()
print(re.findall(r'^(-?d+(.d+)?)|[a-zA-Z]+$',i))
For the input -2.55xy, the expected output was [‘-2.55′,’xy’]
But the output came:
[(‘-2.55’, ‘.55’), (”, ”)]
2nd attempt:
My second attempt was similar to my first attempt just a little different:
import re
i=input()
print(re.findall(r'^(-?(d+.d+)|d+)|[a-zA-Z]+$',i))
For the same input -2.55xy, the output came as:
[(‘-2.55’, ‘2.55’), (”, ”)]
3rd attempt:
My next attempt was like that:
import re
i=input()
print(re.findall(r'^-?[1-9.]+|[a-z|A-Z]+$',i))
which matched the expected output for -2.55xy and also with the sample examples. But when the input is 2..5 or something like that, it considers that also as a float.
4th attempt:
import re
i=input()
value=re.findall(r"[a-zA-Z]+",i)
print([i.replace(value[0],""),value[0]])
which also matches the expected output but has the same problem as 3rd one that goes with it. Also, it doesn’t look like an effective way to do it.
Conclusion:
So I don’t know why my 1st and 2nd attempt isn’t working. The output comes with a list of tuples which is maybe because of the groups but I don’t know the exact reason and don’t know how to solve them. Maybe I didn’t understand the way the pattern works. Also why the substring didn’t show in the output?
In the end, I want to know what’s the mistake in my code and how can I write better and more efficient code to solve the problem. Thank you and sorry for my bad English.
The alternation |
matches either the left part or the right part.
If the chars a-zA-Z are after the digit, you don’t need the alternation |
and you can use 2 capture groups to get the matches in that order.
Then using re.findall will return a list of tuples for the capture group values.
(-?d+(?:.d+)?)([a-zA-Z]+)
Explanation
(
Capture group 1-?d+
Match an optional –(?:.d+)?
Optionally match.
and 1+ digits using a non capture group (so it is not outputted separately by re.findall)
)
Close group 1(
Capture group 2[a-zA-Z]+
Match 1+ times a char a-z or A-Z
)
Close group 2
import re
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
pattern = r"(-?d+(?:.d+)?)([a-zA-Z]+)"
for s in strings:
print(re.findall(pattern, s))
Output
[('2.5', 'ax')]
[('-5', 'bcf')]
[('-69.67', 'Gh')]
lookahead and lookbehind in re.sub simplify things sometimes.
- (?<=d) look behind
- (?=[a-zA-Z]) look ahead
that is split between the digit and the letter.
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
for s in strings:
print(re.split(r'(?<=d)(?=[a-zA-Z])', s))
['2.5', 'ax']
['-5', 'bcf']
['-69.67', 'Gh']