Extract last sequence of digits from string along with everything that precede it
Question:
Consider the following string:
AB01CD03
What I want to do is break it down into two tokens namely AB01CD and 03.
In my string the number of digits following the last alpha character is unknown. There is always a sequence of digits at the end of the string.
Now, I can do this:
import re
S = 'AB01CD03'
v, = re.findall(r'(d+)$', S)
assert v == '03'
…and because I now know the length of v I can deduce how to acquire the preamble using a slice – e.g.,
preamble = S[:-len(v)]
assert preamble == 'AB01CD'
Bearing in mind that the preamble may contain digits, what I’m looking for is a single RE that will reveal the two separate tokens – i.e.,
a, b = re.findall(MAGIC_EXPRESSION, S)
Is this possible?
Answers:
Yes, like this:
import re
s = 'AB01CD03'
m = re.match(r'^(.+?)(d+)$', s)
print(m.group(1), m.group(2))
This works because the group (.+?)
is not greedy, so the second group (d+)
is allowed to match all the digits at the end. ^
and $
ensure the groups sit at the start and end respectively.
Result:
AB01CD 03
Closer to the syntax you were asking for:
a, b = re.match(r'^(.+?)(d+)$', s).groups()
You can use this:
import re
ls = ['AB01CD03', 'AB34565701CD04564563']
for s in ls:
a, b = re.findall(r'(.*(?:D|^))(d+)', s)[0]
print(a,b)
Output:
AB01CD 03
AB34565701CD 04564563
(.*(?:D|^))(d+)
1st Capturing Group (.*(?:D|^))
-
.
matches any character (except for line terminators)
-
*
matches the previous token between zero and unlimited times,
as many times as possible, giving back as needed (greedy)
Non-capturing group (?:D|^)
1st Alternative D
D
matches any character that’s not a digit (equivalent to [^0-9])
2nd Alternative ^
^
asserts position at start of a line
2nd Capturing Group (d+)
-
d
matches a digit (equivalent to [0-9])
+
matches the previous token between one and unlimited times, as
many times as possible, giving back as needed (greedy)
Consider the following string:
AB01CD03
What I want to do is break it down into two tokens namely AB01CD and 03.
In my string the number of digits following the last alpha character is unknown. There is always a sequence of digits at the end of the string.
Now, I can do this:
import re
S = 'AB01CD03'
v, = re.findall(r'(d+)$', S)
assert v == '03'
…and because I now know the length of v I can deduce how to acquire the preamble using a slice – e.g.,
preamble = S[:-len(v)]
assert preamble == 'AB01CD'
Bearing in mind that the preamble may contain digits, what I’m looking for is a single RE that will reveal the two separate tokens – i.e.,
a, b = re.findall(MAGIC_EXPRESSION, S)
Is this possible?
Yes, like this:
import re
s = 'AB01CD03'
m = re.match(r'^(.+?)(d+)$', s)
print(m.group(1), m.group(2))
This works because the group (.+?)
is not greedy, so the second group (d+)
is allowed to match all the digits at the end. ^
and $
ensure the groups sit at the start and end respectively.
Result:
AB01CD 03
Closer to the syntax you were asking for:
a, b = re.match(r'^(.+?)(d+)$', s).groups()
You can use this:
import re
ls = ['AB01CD03', 'AB34565701CD04564563']
for s in ls:
a, b = re.findall(r'(.*(?:D|^))(d+)', s)[0]
print(a,b)
Output:
AB01CD 03
AB34565701CD 04564563
(.*(?:D|^))(d+)
1st Capturing Group (.*(?:D|^))
-
.
matches any character (except for line terminators) -
*
matches the previous token between zero and unlimited times,
as many times as possible, giving back as needed (greedy)
Non-capturing group (?:D|^)
1st Alternative D
D
matches any character that’s not a digit (equivalent to [^0-9])
2nd Alternative ^
^
asserts position at start of a line
2nd Capturing Group (d+)
-
d
matches a digit (equivalent to [0-9])+
matches the previous token between one and unlimited times, as
many times as possible, giving back as needed (greedy)