Python Regex to extract text between numbers

Question:

I’d like to extract the text between digits. For example, if have text such as the following

1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares 
TG MARTIN
C MARTIN
7500 ORDINARY shares 
ARCO LIMITED

I want to produce a list of 3 elements, where each element is the text between the numbers including the first number but not the end number, and the final element in the list where there is no end number

[
'1964 ORDINARY shares nEXECUTORS OF JOANNA C RICHARDSON',
'100 ORDINARY shares nTG MARTINnC MARTINn',
'7500 ORDINARY sharesnARCO LIMITED'
]

I tried doing this

regex = r'd(.+?)d
re.findall(regex, a, re.DOTALL)

but it returned

['9',
 ' ORDINARY sharesnEXECUTORS OF JOANNA C RICHARDSONn',
 '0 ORDINARY sharesnTG MARTINnC MARTINn',
 '0']
Asked By: user1753640

||

Answers:

You can use the below code to achieve this.

import re

text = """1964 ORDINARY shares
EXECUTORS OF JOANNA C RICHARDSON
100 ORDINARY shares 
TG MARTIN
C MARTIN
7500 ORDINARY shares 
ARCO LIMITED"""

# Use regex to find the text between digits
pattern = r'd+.*?(?=d|$)'
matches = re.findall(pattern, text, flags=re.DOTALL)

print(matches)
Answered By: warwick12

The pattern d(.+?)d matches at least 3 characters, where the outer 2 digits are matched, and the inner part is captured in group 1 (where (.+?) matches at least 1 character)

You get those results because you are using a capture group with re.findall, which returns the value of the capture group.

So for example in 1964 you match 196, where 9 is captured in group 1 and that is the first value in your result.

There is a downvoted and removed answer by markalex and a comment by Michael Butscher that hold a key to use a pattern without re.DOTALL and a non greedy quantifier.

bd+bD*

Explanation

  • bd+b Match 1+ digits between word boundaries to prevent a partial word match
  • D* Match optional chars other than digits, including newlines

Regex demo | Python demo

If the matches should be from the start of the string and be followed by a whitespace char, you might also consider using an anchor with re.M for multiline

^d+sD*

Regex demo | Pyton demo

Answered By: The fourth bird
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.