Regex python: Extracting text from a table

Question:

We have a table (in a .doc document) that looks as follows:

item_number item_code description unit QUANTITY BID AMOUNT
1 074016 CONSTRUCTION SITE MANAGEMENT LS LUMP SUM 24,826.49 24,826.49
2 074017 PREPARE WATER POLLUTION CONTROL PROGRAM LS 30 125.38 3,761.40
3 840521 4" THERMOPLASTIC TRAFFIC STRIPE (BROKEN 6-1) SQFT LUMP SUM .19 32.30

I imported the text in the .doc file and am now using regex to extract the table. When imported, the table text looks as follows:

   1     074016      CONSTRUCTION SITE MANAGEMENT                   LS     LUMP SUM           24,826.49        24,826.49
   2     074017      PREPARE WATER POLLUTION CONTROL                LS     LUMP SUM              708.63           708.63
                     PROGRAM
   3     074038      TEMPORARY DRAINAGE INLET PROTECTION            EA           30              125.38         3,761.40
   4     074041      STREET SWEEPING                                LS     LUMP SUM           10,379.25        10,379.25
   5     120090      CONSTRUCTION AREA SIGNS                        LS     LUMP SUM            9,880.75         9,880.75
   6     120100      TRAFFIC CONTROL SYSTEM                         LS     LUMP SUM           10,932.61        10,932.61
   7     152440      ADJUST MANHOLE TO GRADE                        EA          110              453.42        49,876.20
   8     153103      COLD PLANE ASPHALT CONCRETE PAVEMENT           SQYD    143,000                1.37       195,910.00

I am trying to create a pattern that can capture different variable values in different regex groups. Right now, the pattern I have is ^(s{6}|s{7})(d+)s+(d+)s+(([A-Z.]{2}[^nd]*[A-Z)]s{2})). But, it also captures LS and LUMP SUM in the third group (description).

Code:

# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
import itertools

text = '... 
074016      CONSTRUCTION SITE MANAGEMENT                   LS     LUMP SUM           24,826.49        24,826.49
074017      PREPARE WATER POLLUTION CONTROL                LS     LUMP SUM              708.63           708.63
                         PROGRAM
074038      TEMPORARY DRAINAGE INLET PROTECTION            EA           30              125.38         3,761.40
074041      STREET SWEEPING                                LS     LUMP SUM           10,379.25        10,379.25
120090      CONSTRUCTION AREA SIGNS                        LS     LUMP SUM            9,880.75         9,880.75
120100      TRAFFIC CONTROL SYSTEM                         LS     LUMP SUM           10,932.61        10,932.61
152440      ADJUST MANHOLE TO GRADE                        EA          110              453.42        49,876.20
153103      COLD PLANE ASPHALT CONCRETE PAVEMENT           SQYD    143,000                1.37       195,910.00
015299      LEAD COMPLIANCE PLAN (STRIPE REMOVAL)          LS     LUMP SUM              828.25           828.25
374002      ASPHALTIC EMULSION (FOG SEAL COAT)             TON          18            1,013.60        18,244.80
390095      REPLACE ASPHALT CONCRETE SURFACING             CY          160              277.89        44,462.40
390137      RUBBERIZED HOT MIX ASPHALT (GAP GRADED)        TON       9,650              101.05       975,132.50
394050      RUMBLE STRIP                                   STA         180               26.38         4,748.40
015300      REPLACE AIR MARKER                             EA           50              139.27         6,963.50
840504      4" THERMOPLASTIC TRAFFIC STRIPE                LF      146,000                 .36        52,560.00'
# creating a list: each item is a row from the dataset
text = re.split(r'(?ms)ns+d+', text)

Is there a way to capture different variables in different groups? Any help would be appreciated. Thank you!

Edit 1:
The code misses datasets that look as follows:

   1     074016      CONSTRUCTION SITE MANAGEMENT                   LS     LUMP SUM              240.00           240.00
   2     074019      PREPARE STORM WATER POLLUTION                  LS     LUMP SUM            2,300.00         2,300.00
                     PREVENTION PLAN
   3     074038      TEMPORARY DRAINAGE INLET PROTECTION            EA           12              240.00         2,880.00
   4     074041      STREET SWEEPING                                LS     LUMP SUM            1,700.00         1,700.00
   5     074042      TEMPORARY CONCRETE WASHOUT (PORTABLE)          LS     LUMP SUM              370.00           370.00
   6     120090      CONSTRUCTION AREA SIGNS                        LS     LUMP SUM            7,100.00         7,100.00
   7     120100      TRAFFIC CONTROL SYSTEM                         LS     LUMP SUM           35,900.00        35,900.00
   8     120165      CHANNELIZER (SURFACE MOUNTED)                  EA           40               20.00           800.00
   9     128650      PORTABLE CHANGEABLE MESSAGE SIGN               EA            4            2,200.00         8,800.00
  10     129000      TEMPORARY RAILING (TYPE K)                     LF          960               27.50        26,400.00
  11     129100      TEMPORARY CRASH CUSHION MODULE                 EA           56              127.00         7,112.00
  12     150662      REMOVE METAL BEAM GUARD RAILING                LF        1,390                3.00         4,170.00
  13     153210      REMOVE CONCRETE                                CY            2              660.00         1,320.00
  14     015310      REMOVE BRIDGE APPROACH GUARD RAILING           LF          200                6.30         1,260.00
  15     156585      REMOVE CRASH CUSHION                           EA            1              300.00           300.00
  16     160101      CLEARING AND GRUBBING                          LS     LUMP SUM            2,500.00         2,500.00
  17     190110      LEAD COMPLIANCE PLAN                           LS     LUMP SUM              850.00           850.00
  18 (F) 510502      MINOR CONCRETE (MINOR STRUCTURE)               CY            4            2,740.00        10,960.00
  19     820118      GUARD RAILING DELINEATOR                       EA           12               15.00           180.00
  20     839303      SINGLE THRIE BEAM BARRIER (STEEL POST)         LF        3,630               22.00        79,860.00

I suspect this could be due to the (F) there. Is there a way to tackle this? Thank you so much!

Edit 2:

#import sys
#sys.modules[__name__].__dict__.clear()

# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
import itertools
from io import StringIO

# setting directory
os.chdir('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc-test')

text = textract.process('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc/081204R0.doc_133.doc')
text = text.decode("utf-8")

# splitting by contract number
nob = text.split('BID OPENING DATE')
del nob[0]

# create a dataframe to store the data
# dff = pd.DataFrame(columns = ['contract_number', 'item_number', 'item_code', 'description', 'unit', 'QUANTITY', 'BID', 'AMOUNT'])
dff = pd.DataFrame(columns = ['0', '1', '2', '3', '4', '5', '6', '7'])

# file level loop starts here
dataframes = dict() 
for i in range(len(nob)):

    try:

        txt = nob[i]

        # contract number
        cn1 = re.search('CONTRACT NUMBERs+(.........)', txt)
        cn2 = re.search('CONTRACT NUMBERn+(.*)', txt)

        if not (cn1 is None):
            cn = cn1.group(1)
        elif not (cn2 is None):
            cn = cn2.group(1)
        else:
            cn = "Not captured"

        # getting the contract proposal of low bidder table
        hi = re.split('(?ms)C O N T R A C Ts+P R O P O S A Ls+O Fs+L O Ws+B I D D E R(.*?)S U M M A R Y', txt)
        hi = hi[1]

        # splitting again
        hi = re.split('(?ms)---(s+n.*?s*)TOTAL', hi)
        hi = hi[1]
        hi.replace('(F)', ' ')
        hi.replace('(S)', ' ')

        df = pd.read_fwf(StringIO(hi), header=None, dtype=str)

        # df.columns = ['item_number', 'item_code', 'description', 'unit', 'QUANTITY', 'BID', 'AMOUNT']

        df['item_number'] = df['item_number'].ffill()
        df = df.fillna('').groupby('item_number').agg(lambda x: ' '.join(x).strip()).reset_index()

        df['contract_number'] = cn

        i = df.columns.get_loc('item_number')

        # dff = dff.append(df, ignore_index = True)
        dataset = pd.read_csv('dataset.csv')
        dataframes['ok'] = dataset
        dff = dff.append(df, ignore_index = True)

    except Exception as e:
        print(e)
        print('Error in contract number: ' + cn)
# print(dff)
Asked By: Pepa

||

Answers:

To join correct the table you can try next example:

from io import StringIO

txt = '''
   1     074016      CONSTRUCTION SITE MANAGEMENT                   LS     LUMP SUM           24,826.49        24,826.49
   2     074017      PREPARE WATER POLLUTION CONTROL                LS     LUMP SUM              708.63           708.63
                     PROGRAM
   3     074038      TEMPORARY DRAINAGE INLET PROTECTION            EA           30              125.38         3,761.40
   4     074041      STREET SWEEPING                                LS     LUMP SUM           10,379.25        10,379.25
   5     120090      CONSTRUCTION AREA SIGNS                        LS     LUMP SUM            9,880.75         9,880.75
   6     120100      TRAFFIC CONTROL SYSTEM                         LS     LUMP SUM           10,932.61        10,932.61
   7     152440      ADJUST MANHOLE TO GRADE                        EA          110              453.42        49,876.20
   8     153103      COLD PLANE ASPHALT CONCRETE PAVEMENT           SQYD    143,000                1.37       195,910.00'''

df = pd.read_fwf(StringIO(txt), header=None, dtype=str)
df.columns = ['item_number', 'item_code', 'description', 'unit', 'QUANTITY', 'BID', 'AMOUNT']

df['item_number'] = df['item_number'].ffill()
df = df.fillna('').groupby('item_number').agg(lambda x: ' '.join(x).strip()).reset_index()

print(df)

Prints:

  item_number item_code                              description  unit  QUANTITY        BID      AMOUNT
0           1    074016             CONSTRUCTION SITE MANAGEMENT    LS  LUMP SUM  24,826.49   24,826.49
1           2    074017  PREPARE WATER POLLUTION CONTROL PROGRAM    LS  LUMP SUM     708.63      708.63
2           3    074038      TEMPORARY DRAINAGE INLET PROTECTION    EA        30     125.38    3,761.40
3           4    074041                          STREET SWEEPING    LS  LUMP SUM  10,379.25   10,379.25
4           5    120090                  CONSTRUCTION AREA SIGNS    LS  LUMP SUM   9,880.75    9,880.75
5           6    120100                   TRAFFIC CONTROL SYSTEM    LS  LUMP SUM  10,932.61   10,932.61
6           7    152440                  ADJUST MANHOLE TO GRADE    EA       110     453.42   49,876.20
7           8    153103     COLD PLANE ASPHALT CONCRETE PAVEMENT  SQYD   143,000       1.37  195,910.00
Answered By: Andrej Kesely
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.