Regex – grabbing sections with periods 1.1, 1.1.1, etc

Question:

I am trying to use regex to grab text from in between sections of a document that have numbered headers. The document has a table of contents and section headers with periods in the numbers for the sections. Ex: 1. Introduction, 1.1 Something, 1.1.1 Something Else
I’m able to parse the TOC just fine and get just the section numbers (1.1, 1.1.1, etc.) and am failing in trying to parse the text of the document between those two numbers.

Consider the following (given the document text is just one big string):

1.1 Introduction
There are some sentences in here that I want and I want to do other things with them. There could be hundreds of sentences, who cares.
1.1.1 Something Else
This is where we talk about something else in life.
...
5.1.1 Conclusion

I have tried the following to get the text between 1.1 and 1.1.1 for example and a few variations of such and seem stuck.

(?s)1.0(.*)1.1

This works if the only thing in the document is sections 1.0 and 1.1 but since I don’t have that luxury…. any help is greatly appreciated.

Asked By: kpcrash

||

Answers:

I’m not completely sure of how it’s being used in your python code, but here’s a regex that might help:

/([d.]+)/g

or in python:

import re

matches = re.findall("([d.]+)", your_string)

As an explanation:

  • d means any numeral char (0-9)
  • . means a literal .
  • [<multiple_things>] means any one of <mutliple_things>
  • + means 1 or more occurrences in a row.

So the regex is matching a number or period any number of times in a row, as long as there is nothing in between them.

# Some examples it would match:
1
.
1.1
1.1.1
11.1.111
1.11111111.111111
1.1.
.1
1....
....
1111
# Examples it would NOT match:
1 .1
1a.2
Answered By: coderkearns
    text='''1.1 Introduction
    There are some sentences in here that I want and I want to do other things with them. There could be hundreds of sentences, who cares.
    1.1.1 Something Else
    This is where we talk about something else in life.
    ...
    5.1.1 Conclusion'''
    


   for e in re.findall(r'^[^d.]+', text,re.MULTILINE):
    print(e)


 Introduction
There are some sentences in here that I want and I want to do other things with them
 There could be hundreds of sentences, who cares


 Something Else
This is where we talk about something else in life


    
Answered By: LetzerWille

Use re.split to split on the numbers using a regex like the one below.

^d+(?:.d+)*

This matches one or more digits d+ followed by zero or more occurrences of the subpattern, period followed by one or more digits (?:.d+)*.

The items of the resulting list are then the text between the numbers including the text on the header line itself.

If you need the section numbers too, use a capturing pattern in the regex (add parentheses around the above). The list will then have both the section numbers and the text between them. Even-numbered items are the text between, and odd-numbered items are the section numbers.

Answered By: kindall

You might use 2 capture groups and a negative lookahead to match all lines not starting with the digits and dot:

^d+(?:.d+)+b(.*)((?:n(?!d+.d).*)*)

The pattern matches:

  • ^ Start of string
  • d+(?:.d+)+ Match 1+ digits, and repeat 1+ times a . and 1+ digits
  • b A word boundary
  • (.*) Capture group 1, match the rest of the line
  • ( Capture group 2
    • (?:n(?!d+.d).*)* Match a newline and the rest of the line if it does not start with digits and a dot
  • ) Close group 2

Regex demo

Example

import re

pattern = r"^d+(?:.d+)+b(.*)((?:n(?!d+.d).*)*)"

s = ("1.1 Introductionn"
            "There are some sentences in here that I want and I want to do other things with them. There could be hundreds of sentences, who cares.n"
            "1.1.1 Something Elsen"
            "This is where we talk about something else in life.n"
            "...n"
            "5.1.1 Conclusion")

print(re.findall(pattern, s, re.M))

Output

[(' Introduction', 'nThere are some sentences in here that I want and I want to do other things with them. There could be hundreds of sentences, who cares.'), (' Something Else', 'nThis is where we talk about something else in life.n...'), (' Conclusion', '')]
Answered By: The fourth bird
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.