How to make regex running more than fast with data long paragraph

Question

I want to extract text based on the contents of a paragraph column in a pandas dataframe which has text in uppercase, headings followed by letters of the alphabet, and chapter headings followed by row numbers using regex. Previously I asked using chat gpt but the code suggested by chat gpt always produces code that takes a long time to execute, can be up to 30 minutes more.

Here is the code :

import re
import pandas as pd

# Create the dataframe
df = pd.DataFrame({'Paragraph': [
'PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA bahwa untuk melaksanakan ketentuan Pasal 97, Pasal 101, pasal 104 dan Pasal 106',
'BAB I KETENTUAN UMUM Pasal 1 Dalam Peraturan Menteri ini yang dimaksud dengan',
'BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN Pasal 2 (1) Pengaturan tata cara',
'BAB III RENCANA UMUM PEMELIHARAAN JALAN Pasal 3 (1) Penyelenggara jalan wajib menyusun'
]})

# Define the regular expression pattern to match the headings
pattern = r'^([A-Z]+s*)+(BABs+[IVX]+|[A-Z]+.s*[a-z]+.)'

# Extract the headings using the regular expression pattern
df['Heading'] = df['Paragraph'].apply(lambda x: re.match(pattern, x).group(0))

# Print the resulting dataframe
print(df)

I want output like this :

Paragraph	Heading
PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA bahwa untuk melaksanakan ketentuan Pasal 97, Pasal 101, pasal 104 dan Pasal 106	PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA
BAB I KETENTUAN UMUM Pasal 1 Dalam Peraturan Menteri ini yang dimaksud dengan	BAB I KETENTUAN UMUM
BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN Pasal 2 (1) Pengaturan tata cara	BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN
BAB III RENCANA UMUM PEMELIHARAAN JALAN Pasal 3 (1) Penyelenggara jalan wajib menyusun	BAB III RENCANA UMUM PEMELIHARAAN JALAN

Asked By: Annisa Lianda

||

Source

Answer 1

For your sample data, you can use str.extract, using this regex:

^((?:.(?![A-Z]?[a-z]))+)

which matches:

^ : start of line
(?:.(?![A-Z]?[a-z])) : any character that is not followed by a lowercase letter (optionally preceded by an uppercase letter)

For your sample data:

df['Heading'] = df['Paragraph'].str.extract(r'^((?:.(?![A-Z]?[a-z]))+)')

Output:

0    PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA
1                                   BAB I KETENTUAN UMUM
2          BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN
3                BAB III RENCANA UMUM PEMELIHARAAN JALAN
Name: Heading, dtype: object

Answered By: Nick

How to make regex running more than fast with data long paragraph

Question:

Answers: