How to make a new line for a sentence after finished sentene with dot?

Question:

I have a large text file in Python. I want to make a new line for each sentences. For each line should contain only one sentence information.

For example:

Input:

The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world". Numerous attempts in the 21. century to settle the debate.


Output:

The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. 
Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world".
Numerous attempts in the 21. century to settle the debate.

I tried :

with open("new_all_data.txt", 'r') as text, open("new_all_data2.txt", "w") as new_text2:
    text_lines = text.readlines()

    for line in text_lines:

        if "." in line:

           new_lines = line.replace(".", ".n")
           new_text2.write(new_lines)

It makes a new line for sentences; however, it makes a new line for every string after ".".

For example:

The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. 
Considered an archetypal masterpiece of the Italian Renaissance, it has been described as "the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world".
Numerous attempts in the 21.
century to settle the debate.

I want to keep "Numerous attempts in the 21. century to settle the debate" in one line.

Asked By: rocinantes

||

Answers:

You only need to replace periods followed by a space and a capital letter:

import re

with open("new_all_data.txt", 'r') as text, open("new_all_data2.txt", "w") as new_text2:
    text_lines = text.readlines()
    for line in text_lines:
        if "." in line:
            new_lines = re.sub(
               r"(?<=.) (?=[A-Z])",
               "n",
               line
            )
            new_text2.write(new_lines)

I use the re module that allows performing regex-based replacements with the function re.sub. Then, in the line, I search for spaces that match the following regex: (?<=.) (?=[A-Z])

  • The space must have a period right before it. I use (?<=xxx) which is a positive look behind, it makes sure that the match has xxx just before). . matches a period, so (?<=.) (note the space at the end) makes sure I match spaces that have a period right before it.
  • The space must have a capital letter right after it. I use (?=xxx) which is a positive look ahead, it makes sure that the match has xxx just after). [A-Z] matches any capital letter, so (?=[A-Z]) (note the space at the beginning) makes sure I match spaces that have a capital letter after it.

Combining those two conditions should be enough to replace by a new line only spaces that are between two sentences.

Answered By: leleogere