Python regular expression to extract string from python dataframe

Question:

I coded a PDF extraction through Python, and reading it into Python string. I am trying to extract data from different PDFs, and the structure for the addresses on each document is slightly different. Here is the example:

Alamat :Menara Bank Mega, Lantai 24, Jl. Kapten P Tendean
Kav. 12-14A

Alamat :JL USMAN NO. 42, RT 8/4, KEL. KELAPA DUA WETAN,
KEC. PASAR REBO, JAKARTA TIMUR

Alamat :JL. HR. RASUNA SAID KAV 1-2, GRAHA IRAMA
LANTAI 6 KUNINGAN TIMUR- SETIABUDI JAKARTA
SELATAN

Alamat :GD. GRAHA PRATAMA LT.10, JL. MT. HARYONO
KAV.15, TEBET

AHUAlamat :GEDUNG BERITASATU PLAZA LT. 8, JL. JEND. GATOT
SUBROTO KAV. 35-36

I expect to extract everything after the ":". Is there a regular expression to find all of the things on the above?

Asked By: htm_01

||

Answers:

Using re.search() is one possible approach:

(?:Alamat|AHUAlamat): is a non-capturing group which matches either "Alamat" or "AHUAlamat".
s*: matches any number of whitespace characters.
:: matches the colon character.
(.*): is a capturing group which matches any series of characters except newlines.

import re

data_str = """Alamat :Menara Bank Mega, Lantai 24, Jl. Kapten P Tendean Kav. 12-14A
Alamat :JL USMAN NO. 42, RT 8/4, KEL. KELAPA DUA WETAN, KEC. PASAR REBO, JAKARTA TIMUR
Alamat :JL. HR. RASUNA SAID KAV 1-2, GRAHA IRAMA LANTAI 6 KUNINGAN TIMUR- SETIABUDI Jakarta SELATAN
Alamat :GD. GRAHA PRATAMA LT.10, JL. MT. HARYONO KAV.15, TEBET
AHUAlamat :GEDUNG BERITASATU PLAZA LT. 8, JL. JEND. GATOT SUBROTO KAV. 35-36
"""

pattern = r'(?:Alamat|AHUAlamat)s*:(.*)'
addresses = data_str.splitlines()

for address in addresses:
    match = re.search(pattern, address)
    if match:
        print(match.group(1).strip())

Note: If every line of string have the same structure with : then split() alone can do the job:

lst_data = data_str.splitlines()
addresses = [address.split(':')[-1] for address in lst_data]
print(*addresses, sep='n')

Menara Bank Mega, Lantai 24, Jl. Kapten P Tendean Kav. 12-14A
JL USMAN NO. 42, RT 8/4, KEL. KELAPA DUA WETAN, KEC. PASAR REBO, JAKARTA TIMUR
JL. HR. RASUNA SAID KAV 1-2, GRAHA IRAMA LANTAI 6 KUNINGAN TIMUR- SETIABUDI Jakarta SELATAN
GD. GRAHA PRATAMA LT.10, JL. MT. HARYONO KAV.15, TEBET
GEDUNG BERITASATU PLAZA LT. 8, JL. JEND. GATOT SUBROTO KAV. 35-36
Answered By: Jamiu S.

Your data is afer : and ends by blank line, so you can use this RegEx

/Alamat *:(((?!(nn|rnrn)).)*)/gms

Online check

Answered By: android dev
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.