python regex get number and paragraph between number
Question:
I have a string like below.
10. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
11. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
12. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
13. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
What I want to do is to separate the title and content in chunks and put them in a list.
result = [10. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2, 11. Title textntext1text2text1text2text1text2text1text2text1text2text1text2text1 text2text1 text2text1 text2text1 text2text1text2nntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2, ........]
I’ve tried this, but honestly I have no idea what to do. help
la_text = []
num = 1
for a in range(3):
sepa = re.findall(r"d*(.*)d*", text)[num]
la_text.append(sepa)
num += 1
print(la_text)
Answers:
If s
contains your string from the question you can do:
import re
pat = re.compile(r"^(d+.s+.*?)(?=n^d+.|Z)", flags=re.M | re.S)
print(pat.findall(s))
Prints:
[
"10. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n",
"11. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n",
"12. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n",
"13. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n",
]
To get the title and paragraph separated, you can make use of a negative lookahead with only the multiline flag re.M
Using re.findall will return a list of tuples with 2 values for the capture groups
^(d+..*)((?:n(?!d+.).*)*)
See a regex demo.
To get them together as a single match:
^d+..*(?:n(?!d+.).*)*
See another regex demo.
import re
pattern = r"^d+..*(?:n(?!d+.).*)*"
s = "...."
print(re.findall(pattern, s, re.M))
Output
[
'10. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n',
'11. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n',
'12. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n',
'13. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2'
]
In case you need a dictionary with numbers as keys:
import re
m = re.findall(r'^(d+).(.*)((?:n(?!d+.).*)*)',s , re.M )
{element[0]:[element[1], element[2]] for element in m }
Output:
{'10': [' Title text',
'ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n'],
'11': [' Title text',
'ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n'],
'12': [' Title text',
'ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n'],
'13': [' Title text',
'ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n']}
If you don’t need to separate titles from paragraphs, another idea is to use re.split
re.split(r"ns*(?=d+.)", test_str)
See this demo at regex101 or a Python demo at tio.run
ns*
this splits at a newline and any amount of whitespace
(?=d+.)
if followed by one or more digits and a period
I have a string like below.
10. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
11. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
12. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
13. Title text
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2
What I want to do is to separate the title and content in chunks and put them in a list.
result = [10. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2, 11. Title textntext1text2text1text2text1text2text1text2text1text2text1text2text1 text2text1 text2text1 text2text1 text2text1text2nntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2, ........]
I’ve tried this, but honestly I have no idea what to do. help
la_text = []
num = 1
for a in range(3):
sepa = re.findall(r"d*(.*)d*", text)[num]
la_text.append(sepa)
num += 1
print(la_text)
If s
contains your string from the question you can do:
import re
pat = re.compile(r"^(d+.s+.*?)(?=n^d+.|Z)", flags=re.M | re.S)
print(pat.findall(s))
Prints:
[
"10. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n",
"11. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n",
"12. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n",
"13. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n",
]
To get the title and paragraph separated, you can make use of a negative lookahead with only the multiline flag re.M
Using re.findall will return a list of tuples with 2 values for the capture groups
^(d+..*)((?:n(?!d+.).*)*)
See a regex demo.
To get them together as a single match:
^d+..*(?:n(?!d+.).*)*
See another regex demo.
import re
pattern = r"^d+..*(?:n(?!d+.).*)*"
s = "...."
print(re.findall(pattern, s, re.M))
Output
[
'10. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n',
'11. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n',
'12. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n',
'13. Title textntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2'
]
In case you need a dictionary with numbers as keys:
import re
m = re.findall(r'^(d+).(.*)((?:n(?!d+.).*)*)',s , re.M )
{element[0]:[element[1], element[2]] for element in m }
Output:
{'10': [' Title text',
'ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n'],
'11': [' Title text',
'ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n'],
'12': [' Title text',
'ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n'],
'13': [' Title text',
'ntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2nnntext1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2text1 text2n']}
If you don’t need to separate titles from paragraphs, another idea is to use re.split
re.split(r"ns*(?=d+.)", test_str)
See this demo at regex101 or a Python demo at tio.run
ns*
this splits at a newline and any amount of whitespace(?=d+.)
if followed by one or more digits and a period