Python to combine lines in a text file

Question:

I have a question regarding combine lines in a text file.

The file contents are as below (movie subtitles). I want to combine the subtitles, those English words and sentences in each paragraph into 1 line, instead of now showing either 1, 2 or 3 lines separably.

Which method is feasible in Python?

1
00:00:23,343 --> 00:00:25,678
Been a while since I was up here
in front of you.

2
00:00:25,762 --> 00:00:28,847
Maybe I'll do us all a favour
and just stick to the cards.

3
00:00:31,935 --> 00:00:34,603
There's been speculation that I was
involved in the events that occurred
on the freeway and the rooftop...

4
00:00:36,189 --> 00:00:39,233
Sorry, Mr Stark, do you
honestly expect us to believe that

5
00:00:39,317 --> 00:00:42,903
that was a bodyguard
in a suit that conveniently appeared,

6
00:00:42,987 --> 00:00:45,698
despite the fact
that you sorely despise bodyguards?

7
00:00:45,782 --> 00:00:46,907
Yes.

8
00:00:46,991 --> 00:00:51,662
And this mysterious bodyguard
was somehow equipped
Asked By: Mark K

||

Answers:

The pattern seems to be:

  1. a line with just a number,
  2. the next line with timing information, and
  3. one or more lines of text, separated by a blank line.

I would write a loop that reads lines 1) and 2), and then a nested loop that reads lines 3) until it finds a blank line. This nested loop could join those lines into a single line.

Answered By: Brent Washburne

Still working on the 1st line..rest is what you expected.

with open('/home/cam/Documents/1.txt','rb') as f:
    f_out=open('mytxt','w+')


    lines=f.readlines()
    new_lines=[line.strip() if line == 'n' else line for line in lines]
    #print new_lines



    space_index=[i for i, x in enumerate(new_lines) if x == ""]
    new_list=[0]+space_index

    for i in range(len(new_list)):
        try:
            mylist=new_lines[new_list[i]:new_list[i+1]]
        except IndexError, e:
            mylist=new_lines[new_list[i]:]


        mylist=mylist[1:]

        mylist1=[i.strip() for i in mylist]


        mylist1[2] = " ".join(mylist1[2:])
        final=mylist1[:3]

        finallines=[i+"n" for i in final]
        print finallines

        for i in finallines:
            f_out.write(i)
Answered By: Ajay

Intuitive solution

A simple solution based on the 4 types of lines you can have:

  • an empty line
  • a number indicating the position (no letters)
  • a timing for the subtitle (with a specific pattern; no letters)
  • text

You can just loop over each line, classifying them, and then act accordingly.

In fact, the “action” for a non-text not-empty line (timeline and numeric) is the same. Thus:

import re

with open('yourfile.txt') as f:
    exampleText = f.read()

new = ''

for line in exampleText.split('n'):
    if line == '':
        new += 'nn'
    elif re.search('[a-zA-Z]', line):  # check if there is text
        new += line + ' ' 
    else:
        new += line + 'n' 

Result:

>>> print(new)
1
00:00:23,343 --> 00:00:25,678
Been a while since I was up here in front of you. 

2
00:00:25,762 --> 00:00:28,847
Maybe I'll do us all a favour and just stick to the cards. 
...

Regex explained:

  • [] indicates any of the characters inside
  • a-z indicates the range of characters a-z
  • A-Z indicates the range of characters A-Z
Answered By: PascalVKooten

Loading requirements:

import re

with open('yourfile.txt') as f:
    exampleText = f.read()

Concise one-liner

re.sub('n([0-9]+)n', 'nng<1>n', re.sub('([^0-9])n', 'g<1> ', exampleText))

The first replacement replaces all text ending with a newline with the text ending with a space:

tmp = re.sub('([^0-9])n', 'g<1> ', exampleText)

The previous replacement means we lose the newline at the end of the last part of the texts. Then the second replacement adds a newline in front of these numeric lines:

re.sub('n([0-9]+)n', 'nng<1>n', tmp)
Answered By: PascalVKooten
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.