Split long sentences of a text file around the middle on comma (multiple commas)

Question

I have a .srt file that I’d like to split to watch with mpv. It’s a whole book turned into .srt for language learning, with an audiobook to go along.
My problem is, it’s in Japanese, which doesn’t have space between words, so mpv doesn’t break long sentences, instead it makes them very tiny to fit the one line size.

I tried Subtitle Edit, but it’s not working for Japanese.

So I’m trying to do my own script, although I don’t know much about it.
I’m stuck on how to break a sentence that has multiple commas, how would I choose one around the middle?

Here’s what I got so far:


with open("test.txt", encoding="utf8") as file:
    for line in file:
       #print(line)
       size = len(line)
       if size > 45:
           #break sentence in half, using Japanese comma 、

Here’s the text file I’m using for testing:

10
00:00:55,640 --> 00:01:09,580
クラスで一番、明るくて、優しくて、運動神経がよくて、しかも、頭もよくて、みんなその子と友達になりたがる。

11
00:01:11,090 --> 00:01:24,500
だけどその子は、たくさんいるクラスメートの中に私がいることに気づいて、その顔にお日様みたいな眩しく、優しい微笑みをふわーっと浮かべる。

12
00:01:24,730 --> 00:01:32,250
私に近づき、「こころちゃん、ひさしぶり！」

13
00:01:32,910 --> 00:01:35,180
と挨拶をする。

14
00:01:37,450 --> 00:01:41,730
周りの子がみんな息を吞む中、「前から知ってるの。

15
00:01:42,000 --> 00:01:42,820
ね？」

16
00:01:43,820 --> 00:01:46,550
と私に目配せをする。

Asked By: Mrsha Solstice

||

Source

Answer 1

Can’t you just loop through the sentence and split it by character count? Although this might result in splitting between kanji words, if necessary you will have to add further conditions to check. (Split by nearest ‘、’ ‘を’, ‘は’, ‘。’ etc)

with open("test.txt", encoding="utf8") as file:
    for line in file:
       #print(line)
       size = len(line)
        if size > 45:
            for i in range(0, len(line), 45):
                print(line[i:i + 45])

Answered By: Ayla

Answer 2

My compiler was being weird when I tried to open the file only once, so my solution does the following: Read every line and store them to a list, go through the list and find all the lines that are > 45 characters, find a comma near the middle, then add the line before and after to the list. Once done, write the list to the file.

fileLines = []

def findCommaNearMiddle(line):
    length = len(line)
    middle = int(length/2)
    # check values on either side until comma is found
    distance = 0
    while distance <= middle:
        if line[middle+distance] == '、':
            return middle+distance
        elif line[middle-distance] == '、':
            return middle-distance
        distance += 1
    return -1 # idealy, this will never happen

with open("test.txt", "r", encoding="utf8") as file:
    fileText = file.read()
    fileLines = fileText.split('n');
    for i in range(len(fileLines)):
        line = fileLines[i]
        size = len(line)
        if size > 45:
            middleComma = findCommaNearMiddle(line)
            fileLines[i] = line[:middleComma]
            fileLines.insert(i+1, line[middleComma+1:]) # +1 to get rid of comma
    file.close()

with open("test.txt", "w", encoding="utf8") as file:
    for line in fileLines:
        file.write(line + 'n')

    file.close()

If you want to be able to split by characters other than ‘、’, just add another condition to the two if statements that goes something like or line[middle+distance] == '。':

Answered By: Gannon

Answer 3

you can locate the comma that is closest to the middle of the sentence and split the sentence at that comma.

with open("test.txt", encoding="utf8") as file:
for line in file:
    size = len(line)
    if size > 45:
        # Find the comma closest to the middle of the line
        middle = size // 2
        comma_index = line.rfind("、", 0, middle)  # rfind() searches for the last occurrence of the comma before the middle
        if comma_index == -1:  # If there is no comma before the middle, split at the middle
            split_index = middle
        else:
            split_index = comma_index + 1  # Split after the comma

        # Split the line at the split_index
        first_line = line[:split_index].strip()
        second_line = line[split_index:].strip()
        print(first_line)
        print(second_line)
    else:
        print(line.strip())

Answered By: Giorgi

Split long sentences of a text file around the middle on comma (multiple commas)

Question:

Answers: