Regex to split on new lines with a pattern

Question:

I am trying to split a string into multiple strings (like observations).

For example, a sample text with 3 "bidder id" observations is:

       BID RANK       BID TOTAL   BIDDER ID         BIDDER INFORMATION  (NAME/ADDRESS/LOCATION)
       --------      -----------  ---------         -------------------------------------------------
           1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                          00814766
                                                    P O BOX 883                       FAX 909 386-1288
                                                    COLTON CA  92324

           2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                          00688659
                                                    2230 LEMON AVENUE                 FAX 562 591-7485
                                                    LONG BEACH CA  90806

           3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                          00116307
                                                    P O BOX 1489                      FAX 818 767-3169
                                                    SUN VALLEY CA  91353-1489

The ultimate goal is to create a dataset that mimics this text document. The first step is to split this big string into multiple small strings. For example, the three small strings would look as follows:

Split string 1

           1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                          00814766
                                                    P O BOX 883                       FAX 909 386-1288
                                                    COLTON CA  92324

Split string 2

           2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                          00688659
                                                    2230 LEMON AVENUE                 FAX 562 591-7485
                                                    LONG BEACH CA  90806

Split String 3

           3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                          00116307
                                                    P O BOX 1489                      FAX 818 767-3169
                                                    SUN VALLEY CA  91353-1489

I started using the split pattern as [rn]+s+, but unfortunately, it splits by any new line and not just the new line with no other character/text in it.

Code:

# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword

txt = "                   1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                              00814766
                                                        P O BOX 883                       FAX 909 386-1288
                                                        COLTON CA  92324

               2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                              00688659
                                                        2230 LEMON AVENUE                 FAX 562 591-7485
                                                        LONG BEACH CA  90806

               3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                              00116307
                                                        P O BOX 1489                      FAX 818 767-3169
                                                        SUN VALLEY CA  91353-1489"

p = re.split("[rn]+",txt)

But it splits text by all the possible new lines. Is there a way to separate text by a newline with no other character in it? Thank you so much!!

P.S. if you think I’m doing something wildly wrong or if there’s a much simpler way to create a dataset – please let me know. Any help is appreciated. Thanks!!

Asked By: Pepa

||

Answers:

You can try re.findall with pattern (regex101):

(?ms)^s{,20}d.*?(?=^s{,20}d|Z)

import re

text = """
       BID RANK       BID TOTAL   BIDDER ID         BIDDER INFORMATION  (NAME/ADDRESS/LOCATION)
       --------      -----------  ---------         -------------------------------------------------
           1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                          00814766
                                                    P O BOX 883                       FAX 909 386-1288
                                                    COLTON CA  92324

           2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                          00688659
                                                    2230 LEMON AVENUE                 FAX 562 591-7485
                                                    LONG BEACH CA  90806

           3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                          00116307
                                                    P O BOX 1489                      FAX 818 767-3169
                                                    SUN VALLEY CA  91353-1489"""

groups = re.findall(r"(?ms)^s{,20}d.*?(?=^s{,20}d|Z)", text)

for group in groups:
    print(group)
    print('-' * 80)

Prints:

           1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                          00814766
                                                    P O BOX 883                       FAX 909 386-1288
                                                    COLTON CA  92324

--------------------------------------------------------------------------------

           2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                          00688659
                                                    2230 LEMON AVENUE                 FAX 562 591-7485
                                                    LONG BEACH CA  90806

--------------------------------------------------------------------------------

           3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                          00116307
                                                    P O BOX 1489                      FAX 818 767-3169
                                                    SUN VALLEY CA  91353-1489
--------------------------------------------------------------------------------
Answered By: Andrej Kesely

You can capture those blocks with:

(?=^[ t]+(?:d+[ t]+[d,.]+[ t]+d))([sS]+?)(?=(?:^[ t]+d+[ t]+[d,.]+[ t]+d)|Z)

Demo

Or split like this and deal with the header by poping 2 lines off before the split:

re.split(r'(?:r?n){2}, s)

Demo

Python demo:

s='''
BID RANK       BID TOTAL   BIDDER ID         BIDDER INFORMATION  (NAME/ADDRESS/LOCATION)
--------      -----------  ---------         -------------------------------------------------
        1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                                                                                                    00814766
                                                                                        P O BOX 883                       FAX 909 386-1288
                                                                                        COLTON CA  92324

        2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                                                                                                    00688659
                                                                                        2230 LEMON AVENUE                 FAX 562 591-7485
                                                                                        LONG BEACH CA  90806

        3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                                                                                                    00116307
                                                                                        P O BOX 1489                      FAX 818 767-3169
                                                                                        SUN VALLEY CA  91353-1489'''

import re 

# with re.split:
print(
    "n-------n".join(re.split(r'(?:r?n){2,}', "n".join(s.splitlines()[2:])))
)

# with re.findall:

print(
    "n-------n".join(re.findall(r'(?=^[ t]+(?:d+[ t]+[d,.]+[ t]))([sS]+?)(?=(?:^[ t]+d+[ t]+[d,.]+[ t])|Z)', s, flags=re.M))
)

Both methods prints:

        1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                                                                                                    00814766
                                                                                        P O BOX 883                       FAX 909 386-1288
                                                                                        COLTON CA  92324
-------
        2         1,534,243.00    3              EXCEL PAVING COMPANY                  562 599-5841  SB PREF CLAIMED
                                                                                                                                                                    00688659
                                                                                        2230 LEMON AVENUE                 FAX 562 591-7485
                                                                                        LONG BEACH CA  90806
-------
        3         1,593,549.40    2              SECURITY PAVING COMPANY INC           818 767-8418  CC PREF CLAIMED
                                                                                                                                                                    00116307
                                                                                        P O BOX 1489                      FAX 818 767-3169
                                                                                        SUN VALLEY CA  91353-1489
Answered By: dawg
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.