Python Regex to extract text between numbers
Question:
I’d like to extract the text between digits. For example, if have text such as the following
1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares
TG MARTIN
C MARTIN
7500 ORDINARY shares
ARCO LIMITED
I want to produce a list of 3 elements, where each element is the text between the numbers including the first number but not the end number, and the final element in the list where there is no end number
[
'1964 ORDINARY shares nEXECUTORS OF JOANNA C RICHARDSON',
'100 ORDINARY shares nTG MARTINnC MARTINn',
'7500 ORDINARY sharesnARCO LIMITED'
]
I tried doing this
regex = r'd(.+?)d
re.findall(regex, a, re.DOTALL)
but it returned
['9',
' ORDINARY sharesnEXECUTORS OF JOANNA C RICHARDSONn',
'0 ORDINARY sharesnTG MARTINnC MARTINn',
'0']
Answers:
You can use the below code to achieve this.
import re
text = """1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares
TG MARTIN
C MARTIN
7500 ORDINARY shares
ARCO LIMITED"""
# Use regex to find the text between digits
pattern = r'd+.*?(?=d|$)'
matches = re.findall(pattern, text, flags=re.DOTALL)
print(matches)
The pattern d(.+?)d
matches at least 3 characters, where the outer 2 digits are matched, and the inner part is captured in group 1 (where (.+?)
matches at least 1 character)
You get those results because you are using a capture group with re.findall, which returns the value of the capture group.
So for example in 1964
you match 196
, where 9
is captured in group 1 and that is the first value in your result.
There is a downvoted and removed answer by markalex and a comment by Michael Butscher that hold a key to use a pattern without re.DOTALL
and a non greedy quantifier.
bd+bD*
Explanation
bd+b
Match 1+ digits between word boundaries to prevent a partial word match
D*
Match optional chars other than digits, including newlines
If the matches should be from the start of the string and be followed by a whitespace char, you might also consider using an anchor with re.M
for multiline
^d+sD*
I’d like to extract the text between digits. For example, if have text such as the following
1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares
TG MARTIN
C MARTIN
7500 ORDINARY shares
ARCO LIMITED
I want to produce a list of 3 elements, where each element is the text between the numbers including the first number but not the end number, and the final element in the list where there is no end number
[
'1964 ORDINARY shares nEXECUTORS OF JOANNA C RICHARDSON',
'100 ORDINARY shares nTG MARTINnC MARTINn',
'7500 ORDINARY sharesnARCO LIMITED'
]
I tried doing this
regex = r'd(.+?)d
re.findall(regex, a, re.DOTALL)
but it returned
['9',
' ORDINARY sharesnEXECUTORS OF JOANNA C RICHARDSONn',
'0 ORDINARY sharesnTG MARTINnC MARTINn',
'0']
You can use the below code to achieve this.
import re
text = """1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares
TG MARTIN
C MARTIN
7500 ORDINARY shares
ARCO LIMITED"""
# Use regex to find the text between digits
pattern = r'd+.*?(?=d|$)'
matches = re.findall(pattern, text, flags=re.DOTALL)
print(matches)
The pattern d(.+?)d
matches at least 3 characters, where the outer 2 digits are matched, and the inner part is captured in group 1 (where (.+?)
matches at least 1 character)
You get those results because you are using a capture group with re.findall, which returns the value of the capture group.
So for example in 1964
you match 196
, where 9
is captured in group 1 and that is the first value in your result.
There is a downvoted and removed answer by markalex and a comment by Michael Butscher that hold a key to use a pattern without re.DOTALL
and a non greedy quantifier.
bd+bD*
Explanation
bd+b
Match 1+ digits between word boundaries to prevent a partial word matchD*
Match optional chars other than digits, including newlines
If the matches should be from the start of the string and be followed by a whitespace char, you might also consider using an anchor with re.M
for multiline
^d+sD*