Python: Extracting multiple lines between RegEx Matches

Question:

Good evening,

I am converting PDF into CSV using python and is using RegEx to extract the information.

The raw text, after extracting text from PDF, could look like this:

Account Transaction Details
Twin Account   123-456-789-1
Date Description Withdrawals Deposits Balance
01 Jan BALANCE B/F 123,456.78  
03 Jan Funds Transfer 195.04 123,456.78  
mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78  
WIRE OTHR
ANTON HARLEY
Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78  
PIB8452145632845963
Abricot 480
OTHR Transfer

I used a RegEx [0-3]{1}[0-9]{1}s[A-Z]{1}[a-z]{2}s[?A-Za-z]{1,155} and managed to get the needed transactions:

01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78
03 Jan Inward Credit-QUICK 3,000.84 123,456.78
03 Jan Funds Trf - SPEED 3,500.00 123,345.78

However, the additional information between the matches had been dropped because I have split the text using n and then running the RegEx.

How do I code such that I get the additional information that is in-between the RegEx matches, and the additional info is tagged to the previous match? This is my envisaged output:

01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78 OTHR ANTON HARLEY Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer

Edit:

I have adapted @dcsuka solution and have gotten the following:

06 Jan Debit-Consumer 12.60 123,456.78   SNIP AVENU13568100 4265884035605848

06 Jan Inward DR - 828.24 123,456.78   SHIP G12345HUJ ITX

07 Jan Funds Transfer 50.00 123,456.78   Pleasenotethatyouareboundbyadutyundertherulesgoverningtheoperationofthisaccount,tochecktheentriesintheabovestatement. Ifyoudonotnotifyusinwritingofanyerrors, omissionsorunauthoriseddebitswithinfourteen(14)daysofthisstatement,theentriesaboveshallbedeemedvalid,correct,accurateandconclusivelybindinguponyou,andyoushallhaveno claim against the bank in relation thereto. XYZ Ltd  •  80 QuincyPlace ABC Plaza XXX 12345  •  Co. Reg. No. 1234567890Z  •  GST Reg. No. YY-8121234-2  •   www.xyzabc.com

07 Jan Inward CR - SPEED 9,092.06 123,456.78   SALAD SALAS Payment CARL QWE 817264950

How do I remove the excess words "Pleasenotethatyouareboundbyadut..." The only pattern I can see is that it would be a very long word (probably more than 20 characters). Is that the way to go?

Edit2:

@dcsuka had adjusted the code to aid in the removal of ‘noise’ by based on words or more than 20 characters. Thank you dcsuka!

Asked By: Madwolf

||

Answers:

You can try using a positive lookahead for a number after newline when you split the string, to get bigger chunks more reflective of your expected output:

import re

split_text = re.split("n(?=d{1,3}s)", text1)

[re.sub("s?w{20,}.*$", "", " ".join(i.split())) for i in split_text if re.search("^dds", i)]

# ['01 Jan BALANCE B/F 123,456.78',
#  '03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690',
#  '03 Jan Inward Credit-QUICK 3,000.84 123,456.78 WIRE OTHR ANTON HARLEY Other',
#  '03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer']
Answered By: dcsuka

I have attempted to look at it again after I have gained more knowledge on regex.

Like what @dcsuka suggested, I would need to use a positive lookahead (so that my regex does not consume the ‘quantifier’ that I set at the end)

This was the code I used:

(^[0-9]{2}) ([A-Z]{1}[a-z]{2}) (.*?)(?=n[0-9]{2} [A-Z]{1}[a-z]{2}|[A-Za-z]{15,})', flags=re.M | re.S

First, I grouped them into:

  1. Date using (^[0-9]{2}), with the ‘^’ to indicate start of line since the date would be 2 digits (01 or 11)
  2. Month using ([A-Z]{1}[a-z]{2}), since the month would be Dec/ Jan/ Feb …
  3. My main capture that I wanted using (.*?), which is description in this case
  4. Date and Month, with other description using (?=n[0-9]{2} [A-Z]{1}[a-z]{2}|[A-Za-z]{15,})
  5. Lastly, I used the flags for multi-line and single-line flags=re.M | re.S, so that the multiline merges into a single line for my regex to search.

Once done, I used re.findall(line_re) to search for all matches.

Hope this helps.

Answered By: Madwolf
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.