regex: cleaning text: remove everything upto a certain line

Question:

I have a text file containing The Tragedie of Macbeth. I want to clean it and the first step is to remove everything upto the line The Tragedie of Macbeth and store the remaining part in removed_intro_file.

I tried:

import re
filename, title = 'MacBeth.txt', 'The Tragedie of Macbeth'
with open(filename, 'r') as file:
    removed_intro = file.read()
    with open('removed_intro_file', 'w') as output:
        removed = re.sub(title, '', removed_intro)
        print(removed)
        output.write(removed)

The print statement doesn’t print anything so it doesn’t match anything. How can I use regex over several lines? Should one instead use pointers that point to the start and end of the lines to removed? I’d also be glad to know if there is a nicer way to solve this maybe not using regex.

Asked By: Wilma

||

Answers:

your regex only replaces title with ''; you want to remove the title and all text before it, so search for all characters (including newlines) from the beginning of the string to the title included; this should work (I only tested it on a sample file I wrote):

removed = re.sub(r'(?s)^.*'+re.escape(title), '', removed_intro)
Answered By: Swifty

We can try reading your file line by line until hitting the target line. After that, read all subsequent lines into the output file.

filename, title = 'MacBeth.txt', 'The Tragedie of Macbeth'
line = ""
with open(filename, 'r') as file:
    while line != title:                 # discard all lines before the Macbeth title
        line = file.readline()
    lines = 'n'.join(file.readlines())  # read all remaining lines
    with open('removed_intro_file', 'w') as output:
        output.write(title + "n" + lines)

This approach is probably faster and more efficient than using a regex approach.

Answered By: Tim Biegeleisen
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.