python regex not capturing variables, but regex working
Question:
I am trying to create a data frame with variables: bidder_rank, bidder_id, bid_total, bidder_info
. I have created a regex pattern, which seems to work on regex101. However, the Python script has been breaking for a reason I cannot figure out.
# imports
import os
import pandas as pd
import re
# text
texty = '''
1 A) $11,644,939.00 VC0000007181 S.T. RHOADES CONSTRUCTION, INC. Phone (530)223-9322
B) 210 Days * 10000 8585 COMMERCIAL WAY CSLB# 00930684
A+B) $13,744,939.00 REDDING CA 96002
2 A) $12,561,053.00 VC0000007021 GR SUNDBERG, INC. Phone (707)825-6565
B) 210 Days * 10000 5211 BOYD ROAD CSLB# 00732695
A+B) $14,661,053.00 ARCATA CA 95521 Fax (707)825-6563
3 A) $13,098,288.00 VC1800001127 CALIFORNIA HIGHWAY CONSTRUCTION GROUP, Phone (925)766-7014
INC.
B) 210 Days * 10000 1647 WILLOW PASS ROAD CSLB# 01027700
A+B) $15,198,288.00 CONCORD CA 94520 Fax (925)265-9101
4 A) $13,661,954.26 VC0000003985 MERCER FRASER COMPANY Phone (707)443-6371
B) 210 Days * 10000 200 DINSMORE DR CSLB# 00105709
A+B) $15,761,954.26 FORTUNA CA 95540 Fax (707)443-0277
Bid Opening Date: 11/15/2022 Page 2
Contract Number: 01-0H20U4 11/15/2022
5 A) $15,396,278.00 VC0000000213 GRANITE CONSTRUCTION COMPANY Phone (831)728-7561
B) 210 Days * 10000 585 W BEACH STREET CSLB# 00000089
A+B) $17,496,278.00 WATSONVILLE CA 95076
Bid Opening Date: 11/15/2022 Page 3
Contract Number: 01-0H20U4 11/15/2022
'''
lines = re.split(r'(?=^d)', texty, flags=re.MULTILINE)
# list of bids
bids = []
# loop through each line in the bid rank bid ID data table
for i in (0, len(lines)-1):
l = lines[i]
ok = re.findall(r"(?ms)(^d+)s*(.*)(VCd+)s+(.*)([sS]*?)(A+B)s+($d{1,3}(,d{3})*(.d+)?))", str(l))
# continue if ok is not empty
if len(ok) == 0:
continue
else:
ok = ok[0]
# first group is bid_rank, third group is bid_id, fourth group is bidder_info, seventh group is bid_total
bidder_rank = ok[0]
bidder_id = ok[2]
bidder_info = ok[3]
bid_total = ok[6]
# create a tuple of the bid rank, bid ID, bidder info, and bid total
bid_tuple = (bidder_rank, bidder_id, bidder_info, bid_total)
# append the tuple to the list of bids
bids.append(bid_tuple)
print(bid_tuple)
# create a dataframe of the bids
biddf = pd.DataFrame(bids, columns=['bidder_rank', 'bidder_id', 'bidder_info', 'bid_total'])
print(biddf)
After digging, it seems that it’s only working for bidder_id = 5.
>>> print(biddf)
bidder_rank bidder_id bidder_info bid_total
0 5 VC0000000213 GRANITE CONSTRUCTION COMPANY Phone (831)728-... $17,496,278.00
But, according to regex101, it should work for all the bidder IDs. Am I missing something?
Answers:
There is a few things we have to change in your code, first in your for loop, you are iterating over a tuple (0, len(lines)-1)
, which means it only checks for the first and last items in lines, then your regex pattern is too complicated, also you do not split your input string into lines in the proper way.
import pandas as pd
import re
text = '''...''' # Your input text here
lines = text.splitlines()
bids = []
pattern = r"(?ms)^s*(d+)s+A)s+($d{1,3}(?:,d{3})*(.d+)?)s+(VCd+)s+([^n]+)"
for i in range(len(lines)):
l = lines[i]
ok = re.findall(pattern, str(l))
if len(ok) == 0:
continue
else:
ok = ok[0]
bidder_rank = ok[0]
bidder_id = ok[3]
bidder_info = ok[4]
bid_total = ok[1]
bid_tuple = (bidder_rank, bidder_id, bidder_info, bid_total)
bids.append(bid_tuple)
print(bid_tuple)
biddf = pd.DataFrame(bids, columns=['bidder_rank', 'bidder_id', 'bidder_info', 'bid_total'])
print(biddf)
Your regex works fine for me, as long as you don’t use the s
flag. I have modified it slightly to remove unnecessary capture groups and change required groups to non-capturing groups to remove them from the output.
(^d+)s*.*?(VCd+)s+(.*)(?:[sS]*?)A+B)s+($d{1,3}(?:,d{3})*(?:.d+)?)
You can then apply re.findall
to the entire text and use that output directly in a call to pd.DataFrame
:
biddf = pd.DataFrame(
re.findall(r'(?m)(^d+)s*.*?(VCd+)s+(.*)(?:[sS]*?)A+B)s+($d{1,3}(?:,d{3})*(?:.d+)?)', text),
columns=['bidder_rank', 'bidder_id', 'bidder_info', 'bid_total']
)
Output:
bidder_rank bidder_id bidder_info bid_total
0 1 VC0000007181 S.T. RHOADES CONSTRUCTION, INC. ... $13,744,939.00
1 2 VC0000007021 GR SUNDBERG, INC. ... $14,661,053.00
2 3 VC1800001127 CALIFORNIA HIGHWAY CONSTRUCTION GROUP, ... $15,198,288.00
3 4 VC0000003985 MERCER FRASER COMPANY ... $15,761,954.26
4 5 VC0000000213 GRANITE CONSTRUCTION COMPANY Phone (831)728-... $17,496,278.00
I am trying to create a data frame with variables: bidder_rank, bidder_id, bid_total, bidder_info
. I have created a regex pattern, which seems to work on regex101. However, the Python script has been breaking for a reason I cannot figure out.
# imports
import os
import pandas as pd
import re
# text
texty = '''
1 A) $11,644,939.00 VC0000007181 S.T. RHOADES CONSTRUCTION, INC. Phone (530)223-9322
B) 210 Days * 10000 8585 COMMERCIAL WAY CSLB# 00930684
A+B) $13,744,939.00 REDDING CA 96002
2 A) $12,561,053.00 VC0000007021 GR SUNDBERG, INC. Phone (707)825-6565
B) 210 Days * 10000 5211 BOYD ROAD CSLB# 00732695
A+B) $14,661,053.00 ARCATA CA 95521 Fax (707)825-6563
3 A) $13,098,288.00 VC1800001127 CALIFORNIA HIGHWAY CONSTRUCTION GROUP, Phone (925)766-7014
INC.
B) 210 Days * 10000 1647 WILLOW PASS ROAD CSLB# 01027700
A+B) $15,198,288.00 CONCORD CA 94520 Fax (925)265-9101
4 A) $13,661,954.26 VC0000003985 MERCER FRASER COMPANY Phone (707)443-6371
B) 210 Days * 10000 200 DINSMORE DR CSLB# 00105709
A+B) $15,761,954.26 FORTUNA CA 95540 Fax (707)443-0277
Bid Opening Date: 11/15/2022 Page 2
Contract Number: 01-0H20U4 11/15/2022
5 A) $15,396,278.00 VC0000000213 GRANITE CONSTRUCTION COMPANY Phone (831)728-7561
B) 210 Days * 10000 585 W BEACH STREET CSLB# 00000089
A+B) $17,496,278.00 WATSONVILLE CA 95076
Bid Opening Date: 11/15/2022 Page 3
Contract Number: 01-0H20U4 11/15/2022
'''
lines = re.split(r'(?=^d)', texty, flags=re.MULTILINE)
# list of bids
bids = []
# loop through each line in the bid rank bid ID data table
for i in (0, len(lines)-1):
l = lines[i]
ok = re.findall(r"(?ms)(^d+)s*(.*)(VCd+)s+(.*)([sS]*?)(A+B)s+($d{1,3}(,d{3})*(.d+)?))", str(l))
# continue if ok is not empty
if len(ok) == 0:
continue
else:
ok = ok[0]
# first group is bid_rank, third group is bid_id, fourth group is bidder_info, seventh group is bid_total
bidder_rank = ok[0]
bidder_id = ok[2]
bidder_info = ok[3]
bid_total = ok[6]
# create a tuple of the bid rank, bid ID, bidder info, and bid total
bid_tuple = (bidder_rank, bidder_id, bidder_info, bid_total)
# append the tuple to the list of bids
bids.append(bid_tuple)
print(bid_tuple)
# create a dataframe of the bids
biddf = pd.DataFrame(bids, columns=['bidder_rank', 'bidder_id', 'bidder_info', 'bid_total'])
print(biddf)
After digging, it seems that it’s only working for bidder_id = 5.
>>> print(biddf)
bidder_rank bidder_id bidder_info bid_total
0 5 VC0000000213 GRANITE CONSTRUCTION COMPANY Phone (831)728-... $17,496,278.00
But, according to regex101, it should work for all the bidder IDs. Am I missing something?
There is a few things we have to change in your code, first in your for loop, you are iterating over a tuple (0, len(lines)-1)
, which means it only checks for the first and last items in lines, then your regex pattern is too complicated, also you do not split your input string into lines in the proper way.
import pandas as pd
import re
text = '''...''' # Your input text here
lines = text.splitlines()
bids = []
pattern = r"(?ms)^s*(d+)s+A)s+($d{1,3}(?:,d{3})*(.d+)?)s+(VCd+)s+([^n]+)"
for i in range(len(lines)):
l = lines[i]
ok = re.findall(pattern, str(l))
if len(ok) == 0:
continue
else:
ok = ok[0]
bidder_rank = ok[0]
bidder_id = ok[3]
bidder_info = ok[4]
bid_total = ok[1]
bid_tuple = (bidder_rank, bidder_id, bidder_info, bid_total)
bids.append(bid_tuple)
print(bid_tuple)
biddf = pd.DataFrame(bids, columns=['bidder_rank', 'bidder_id', 'bidder_info', 'bid_total'])
print(biddf)
Your regex works fine for me, as long as you don’t use the s
flag. I have modified it slightly to remove unnecessary capture groups and change required groups to non-capturing groups to remove them from the output.
(^d+)s*.*?(VCd+)s+(.*)(?:[sS]*?)A+B)s+($d{1,3}(?:,d{3})*(?:.d+)?)
You can then apply re.findall
to the entire text and use that output directly in a call to pd.DataFrame
:
biddf = pd.DataFrame(
re.findall(r'(?m)(^d+)s*.*?(VCd+)s+(.*)(?:[sS]*?)A+B)s+($d{1,3}(?:,d{3})*(?:.d+)?)', text),
columns=['bidder_rank', 'bidder_id', 'bidder_info', 'bid_total']
)
Output:
bidder_rank bidder_id bidder_info bid_total
0 1 VC0000007181 S.T. RHOADES CONSTRUCTION, INC. ... $13,744,939.00
1 2 VC0000007021 GR SUNDBERG, INC. ... $14,661,053.00
2 3 VC1800001127 CALIFORNIA HIGHWAY CONSTRUCTION GROUP, ... $15,198,288.00
3 4 VC0000003985 MERCER FRASER COMPANY ... $15,761,954.26
4 5 VC0000000213 GRANITE CONSTRUCTION COMPANY Phone (831)728-... $17,496,278.00