Regex to split on new lines with a pattern
Question:
I am trying to split a string into multiple strings (like observations).
For example, a sample text with 3 "bidder id" observations is:
BID RANK BID TOTAL BIDDER ID BIDDER INFORMATION (NAME/ADDRESS/LOCATION)
-------- ----------- --------- -------------------------------------------------
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489
The ultimate goal is to create a dataset that mimics this text document. The first step is to split this big string into multiple small strings. For example, the three small strings would look as follows:
Split string 1
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
Split string 2
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
Split String 3
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489
I started using the split pattern as [rn]+s+
, but unfortunately, it splits by any new line and not just the new line with no other character/text in it.
Code:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
txt = " 1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489"
p = re.split("[rn]+",txt)
But it splits text by all the possible new lines. Is there a way to separate text by a newline with no other character in it? Thank you so much!!
P.S. if you think I’m doing something wildly wrong or if there’s a much simpler way to create a dataset – please let me know. Any help is appreciated. Thanks!!
Answers:
You can try re.findall
with pattern (regex101):
(?ms)^s{,20}d.*?(?=^s{,20}d|Z)
import re
text = """
BID RANK BID TOTAL BIDDER ID BIDDER INFORMATION (NAME/ADDRESS/LOCATION)
-------- ----------- --------- -------------------------------------------------
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489"""
groups = re.findall(r"(?ms)^s{,20}d.*?(?=^s{,20}d|Z)", text)
for group in groups:
print(group)
print('-' * 80)
Prints:
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
--------------------------------------------------------------------------------
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
--------------------------------------------------------------------------------
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489
--------------------------------------------------------------------------------
You can capture those blocks with:
(?=^[ t]+(?:d+[ t]+[d,.]+[ t]+d))([sS]+?)(?=(?:^[ t]+d+[ t]+[d,.]+[ t]+d)|Z)
Or split like this and deal with the header by poping 2 lines off before the split:
re.split(r'(?:r?n){2}, s)
Python demo:
s='''
BID RANK BID TOTAL BIDDER ID BIDDER INFORMATION (NAME/ADDRESS/LOCATION)
-------- ----------- --------- -------------------------------------------------
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489'''
import re
# with re.split:
print(
"n-------n".join(re.split(r'(?:r?n){2,}', "n".join(s.splitlines()[2:])))
)
# with re.findall:
print(
"n-------n".join(re.findall(r'(?=^[ t]+(?:d+[ t]+[d,.]+[ t]))([sS]+?)(?=(?:^[ t]+d+[ t]+[d,.]+[ t])|Z)', s, flags=re.M))
)
Both methods prints:
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
-------
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
-------
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489
I am trying to split a string into multiple strings (like observations).
For example, a sample text with 3 "bidder id" observations is:
BID RANK BID TOTAL BIDDER ID BIDDER INFORMATION (NAME/ADDRESS/LOCATION)
-------- ----------- --------- -------------------------------------------------
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489
The ultimate goal is to create a dataset that mimics this text document. The first step is to split this big string into multiple small strings. For example, the three small strings would look as follows:
Split string 1
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
Split string 2
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
Split String 3
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489
I started using the split pattern as [rn]+s+
, but unfortunately, it splits by any new line and not just the new line with no other character/text in it.
Code:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
txt = " 1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489"
p = re.split("[rn]+",txt)
But it splits text by all the possible new lines. Is there a way to separate text by a newline with no other character in it? Thank you so much!!
P.S. if you think I’m doing something wildly wrong or if there’s a much simpler way to create a dataset – please let me know. Any help is appreciated. Thanks!!
You can try re.findall
with pattern (regex101):
(?ms)^s{,20}d.*?(?=^s{,20}d|Z)
import re
text = """
BID RANK BID TOTAL BIDDER ID BIDDER INFORMATION (NAME/ADDRESS/LOCATION)
-------- ----------- --------- -------------------------------------------------
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489"""
groups = re.findall(r"(?ms)^s{,20}d.*?(?=^s{,20}d|Z)", text)
for group in groups:
print(group)
print('-' * 80)
Prints:
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
--------------------------------------------------------------------------------
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
--------------------------------------------------------------------------------
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489
--------------------------------------------------------------------------------
You can capture those blocks with:
(?=^[ t]+(?:d+[ t]+[d,.]+[ t]+d))([sS]+?)(?=(?:^[ t]+d+[ t]+[d,.]+[ t]+d)|Z)
Or split like this and deal with the header by poping 2 lines off before the split:
re.split(r'(?:r?n){2}, s)
Python demo:
s='''
BID RANK BID TOTAL BIDDER ID BIDDER INFORMATION (NAME/ADDRESS/LOCATION)
-------- ----------- --------- -------------------------------------------------
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489'''
import re
# with re.split:
print(
"n-------n".join(re.split(r'(?:r?n){2,}', "n".join(s.splitlines()[2:])))
)
# with re.findall:
print(
"n-------n".join(re.findall(r'(?=^[ t]+(?:d+[ t]+[d,.]+[ t]))([sS]+?)(?=(?:^[ t]+d+[ t]+[d,.]+[ t])|Z)', s, flags=re.M))
)
Both methods prints:
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
-------
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
-------
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489