Python: Extracting multiple lines between RegEx Matches
Question:
Good evening,
I am converting PDF into CSV using python and is using RegEx to extract the information.
The raw text, after extracting text from PDF, could look like this:
Account Transaction Details
Twin Account 123-456-789-1
Date Description Withdrawals Deposits Balance
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78
mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78
WIRE OTHR
ANTON HARLEY
Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78
PIB8452145632845963
Abricot 480
OTHR Transfer
I used a RegEx [0-3]{1}[0-9]{1}s[A-Z]{1}[a-z]{2}s[?A-Za-z]{1,155}
and managed to get the needed transactions:
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78
03 Jan Inward Credit-QUICK 3,000.84 123,456.78
03 Jan Funds Trf - SPEED 3,500.00 123,345.78
However, the additional information between the matches had been dropped because I have split the text using n
and then running the RegEx.
How do I code such that I get the additional information that is in-between the RegEx matches, and the additional info is tagged to the previous match? This is my envisaged output:
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78 OTHR ANTON HARLEY Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer
Edit:
I have adapted @dcsuka solution and have gotten the following:
06 Jan Debit-Consumer 12.60 123,456.78 SNIP AVENU13568100 4265884035605848
06 Jan Inward DR - 828.24 123,456.78 SHIP G12345HUJ ITX
07 Jan Funds Transfer 50.00 123,456.78 Pleasenotethatyouareboundbyadutyundertherulesgoverningtheoperationofthisaccount,tochecktheentriesintheabovestatement. Ifyoudonotnotifyusinwritingofanyerrors, omissionsorunauthoriseddebitswithinfourteen(14)daysofthisstatement,theentriesaboveshallbedeemedvalid,correct,accurateandconclusivelybindinguponyou,andyoushallhaveno claim against the bank in relation thereto. XYZ Ltd • 80 QuincyPlace ABC Plaza XXX 12345 • Co. Reg. No. 1234567890Z • GST Reg. No. YY-8121234-2 • www.xyzabc.com
07 Jan Inward CR - SPEED 9,092.06 123,456.78 SALAD SALAS Payment CARL QWE 817264950
How do I remove the excess words "Pleasenotethatyouareboundbyadut...
" The only pattern I can see is that it would be a very long word (probably more than 20 characters). Is that the way to go?
Edit2:
@dcsuka had adjusted the code to aid in the removal of ‘noise’ by based on words or more than 20 characters. Thank you dcsuka!
Answers:
You can try using a positive lookahead for a number after newline when you split the string, to get bigger chunks more reflective of your expected output:
import re
split_text = re.split("n(?=d{1,3}s)", text1)
[re.sub("s?w{20,}.*$", "", " ".join(i.split())) for i in split_text if re.search("^dds", i)]
# ['01 Jan BALANCE B/F 123,456.78',
# '03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690',
# '03 Jan Inward Credit-QUICK 3,000.84 123,456.78 WIRE OTHR ANTON HARLEY Other',
# '03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer']
I have attempted to look at it again after I have gained more knowledge on regex.
Like what @dcsuka suggested, I would need to use a positive lookahead (so that my regex does not consume the ‘quantifier’ that I set at the end)
This was the code I used:
(^[0-9]{2}) ([A-Z]{1}[a-z]{2}) (.*?)(?=n[0-9]{2} [A-Z]{1}[a-z]{2}|[A-Za-z]{15,})', flags=re.M | re.S
First, I grouped them into:
- Date using
(^[0-9]{2})
, with the ‘^’ to indicate start of line since the date would be 2 digits (01 or 11)
- Month using
([A-Z]{1}[a-z]{2})
, since the month would be Dec/ Jan/ Feb …
- My main capture that I wanted using
(.*?)
, which is description in this case
- Date and Month, with other description using
(?=n[0-9]{2} [A-Z]{1}[a-z]{2}|[A-Za-z]{15,})
- Lastly, I used the flags for multi-line and single-line
flags=re.M | re.S
, so that the multiline merges into a single line for my regex to search.
Once done, I used re.findall(line_re)
to search for all matches.
Hope this helps.
Good evening,
I am converting PDF into CSV using python and is using RegEx to extract the information.
The raw text, after extracting text from PDF, could look like this:
Account Transaction Details
Twin Account 123-456-789-1
Date Description Withdrawals Deposits Balance
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78
mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78
WIRE OTHR
ANTON HARLEY
Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78
PIB8452145632845963
Abricot 480
OTHR Transfer
I used a RegEx [0-3]{1}[0-9]{1}s[A-Z]{1}[a-z]{2}s[?A-Za-z]{1,155}
and managed to get the needed transactions:
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78
03 Jan Inward Credit-QUICK 3,000.84 123,456.78
03 Jan Funds Trf - SPEED 3,500.00 123,345.78
However, the additional information between the matches had been dropped because I have split the text using n
and then running the RegEx.
How do I code such that I get the additional information that is in-between the RegEx matches, and the additional info is tagged to the previous match? This is my envisaged output:
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78 OTHR ANTON HARLEY Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer
Edit:
I have adapted @dcsuka solution and have gotten the following:
06 Jan Debit-Consumer 12.60 123,456.78 SNIP AVENU13568100 4265884035605848
06 Jan Inward DR - 828.24 123,456.78 SHIP G12345HUJ ITX
07 Jan Funds Transfer 50.00 123,456.78 Pleasenotethatyouareboundbyadutyundertherulesgoverningtheoperationofthisaccount,tochecktheentriesintheabovestatement. Ifyoudonotnotifyusinwritingofanyerrors, omissionsorunauthoriseddebitswithinfourteen(14)daysofthisstatement,theentriesaboveshallbedeemedvalid,correct,accurateandconclusivelybindinguponyou,andyoushallhaveno claim against the bank in relation thereto. XYZ Ltd • 80 QuincyPlace ABC Plaza XXX 12345 • Co. Reg. No. 1234567890Z • GST Reg. No. YY-8121234-2 • www.xyzabc.com
07 Jan Inward CR - SPEED 9,092.06 123,456.78 SALAD SALAS Payment CARL QWE 817264950
How do I remove the excess words "Pleasenotethatyouareboundbyadut...
" The only pattern I can see is that it would be a very long word (probably more than 20 characters). Is that the way to go?
Edit2:
@dcsuka had adjusted the code to aid in the removal of ‘noise’ by based on words or more than 20 characters. Thank you dcsuka!
You can try using a positive lookahead for a number after newline when you split the string, to get bigger chunks more reflective of your expected output:
import re
split_text = re.split("n(?=d{1,3}s)", text1)
[re.sub("s?w{20,}.*$", "", " ".join(i.split())) for i in split_text if re.search("^dds", i)]
# ['01 Jan BALANCE B/F 123,456.78',
# '03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690',
# '03 Jan Inward Credit-QUICK 3,000.84 123,456.78 WIRE OTHR ANTON HARLEY Other',
# '03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer']
I have attempted to look at it again after I have gained more knowledge on regex.
Like what @dcsuka suggested, I would need to use a positive lookahead (so that my regex does not consume the ‘quantifier’ that I set at the end)
This was the code I used:
(^[0-9]{2}) ([A-Z]{1}[a-z]{2}) (.*?)(?=n[0-9]{2} [A-Z]{1}[a-z]{2}|[A-Za-z]{15,})', flags=re.M | re.S
First, I grouped them into:
- Date using
(^[0-9]{2})
, with the ‘^’ to indicate start of line since the date would be 2 digits (01 or 11) - Month using
([A-Z]{1}[a-z]{2})
, since the month would be Dec/ Jan/ Feb … - My main capture that I wanted using
(.*?)
, which is description in this case - Date and Month, with other description using
(?=n[0-9]{2} [A-Z]{1}[a-z]{2}|[A-Za-z]{15,})
- Lastly, I used the flags for multi-line and single-line
flags=re.M | re.S
, so that the multiline merges into a single line for my regex to search.
Once done, I used re.findall(line_re)
to search for all matches.
Hope this helps.