How to make regex running more than fast with data long paragraph

Question:

I want to extract text based on the contents of a paragraph column in a pandas dataframe which has text in uppercase, headings followed by letters of the alphabet, and chapter headings followed by row numbers using regex. Previously I asked using chat gpt but the code suggested by chat gpt always produces code that takes a long time to execute, can be up to 30 minutes more.

Here is the code :

import re
import pandas as pd

# Create the dataframe
df = pd.DataFrame({'Paragraph': [
'PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA bahwa untuk melaksanakan ketentuan Pasal 97, Pasal 101, pasal 104 dan Pasal 106',
'BAB I KETENTUAN UMUM Pasal 1 Dalam Peraturan Menteri ini yang dimaksud dengan',
'BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN Pasal 2 (1) Pengaturan tata cara',
'BAB III RENCANA UMUM PEMELIHARAAN JALAN Pasal 3 (1) Penyelenggara jalan wajib menyusun'
]})

# Define the regular expression pattern to match the headings
pattern = r'^([A-Z]+s*)+(BABs+[IVX]+|[A-Z]+.s*[a-z]+.)'

# Extract the headings using the regular expression pattern
df['Heading'] = df['Paragraph'].apply(lambda x: re.match(pattern, x).group(0))

# Print the resulting dataframe
print(df)

I want output like this :

Paragraph Heading
PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA bahwa untuk melaksanakan ketentuan Pasal 97, Pasal 101, pasal 104 dan Pasal 106 PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA
BAB I KETENTUAN UMUM Pasal 1 Dalam Peraturan Menteri ini yang dimaksud dengan BAB I KETENTUAN UMUM
BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN Pasal 2 (1) Pengaturan tata cara BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN
BAB III RENCANA UMUM PEMELIHARAAN JALAN Pasal 3 (1) Penyelenggara jalan wajib menyusun BAB III RENCANA UMUM PEMELIHARAAN JALAN
Asked By: Annisa Lianda

||

Answers:

For your sample data, you can use str.extract, using this regex:

^((?:.(?![A-Z]?[a-z]))+)

which matches:

  • ^ : start of line
  • (?:.(?![A-Z]?[a-z])) : any character that is not followed by a lowercase letter (optionally preceded by an uppercase letter)

For your sample data:

df['Heading'] = df['Paragraph'].str.extract(r'^((?:.(?![A-Z]?[a-z]))+)')

Output:

0    PERATURAN MENTERI PEKERJAAN UMUM REPUBLIK INDONESIA
1                                   BAB I KETENTUAN UMUM
2          BAB II MAKSUD, TUJUAN, DAN LINGKUP PENGATURAN
3                BAB III RENCANA UMUM PEMELIHARAAN JALAN
Name: Heading, dtype: object
Answered By: Nick
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.