Split text into chunks by ensuring the entireness of words

Question:

I have a bunch of text samples. Each sample has a different length, but all of them consist of >200 characters. I need to split each sample into approx 50 chara ters length substrings. To do so, I found this approach:

import re

def chunkstring(string, length):
    return re.findall('.{%d}' % length, string)

However, it splits a text by splitting words. For example, the phrase "I have <…> icecream. <…>" can be split into "I have <…> icec" and "ream. <…>".

This is the sample text:

This paper proposes a method that allows non-parallel many-to-many
voice conversion by using a variant of a generative adversarial
network called StarGAN.

I get this result:

['This paper proposes a method that allows non-paral',
 'lel many-to-many voice conversion by using a varia',
 'nt of a generative adversarial network called Star']

But ideally I would like to get something similar to this result:

['This paper proposes a method that allows non-parallel',
 'many-to-many voice conversion by using a variant',
 'of a generative adversarial network called StarGAN.']

How could I adjust the above-given code to get the desired result?

Asked By: Fluxy

||

Answers:

You can use .{0,50}S* in order to keep matching eventual further non-space characters (S).

I specified 0 as lowerbound since otherwise you’d risk missing the last substring.

See a demo here.

EDIT:

For excluding the trailing empty chunk, use .{1,50}S*, in order to force it to match at least one character.

If you also want to automatically strip the side spaces, use s*(.{1,50}S*).

Answered By: horcrux

For me this sound like task for textwrap built-in module, example using your data

import textwrap
text = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."
print(textwrap.fill(text,55))

output

This paper proposes a method that allows non-parallel
many-to-many voice conversion by using a variant of a
generative adversarial network called StarGAN.

You will probably need some trials to get value which suits your needs best. If you need list of strs use textwrap.wrap i.e. textwrap.wrap(text,55)

Answered By: Daweo
def nearestDelimiter(txt,  cur):
    delimiters = " ;:.!?-—"          
    if(txt[cur] in delimiters) :          
          return cur
    else:
        i=cur
        while ( i>=0 ):
            if (txt[i] in delimiters) :                    
                        return i
            i=i-1
    return 0


def splitText(sentence,chunkLength):
     cursor = 0  
     curlng = chunkLength
     lst = []
     while (curlng < len(sentence)):
         curlng = nearestDelimiter(sentence, curlng)       
         substr = (sentence[cursor : curlng]).strip()
         cursor = curlng        
         curlng = (cursor+chunkLength) if (cursor+chunkLength<len(sentence)) else len(sentence)            
         lst.append(substr)
     lst.append((sentence[cursor : curlng]).strip())
     return lst


txt = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."

cvv = splitText(txt,50)
for cv in cvv:
    print(cv)
Answered By: Sergey Stretovich
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.