Split text into chunks by ensuring the entireness of words
Question:
I have a bunch of text samples. Each sample has a different length, but all of them consist of >200 characters. I need to split each sample into approx 50 chara ters length substrings. To do so, I found this approach:
import re
def chunkstring(string, length):
return re.findall('.{%d}' % length, string)
However, it splits a text by splitting words. For example, the phrase "I have <…> icecream. <…>" can be split into "I have <…> icec" and "ream. <…>".
This is the sample text:
This paper proposes a method that allows non-parallel many-to-many
voice conversion by using a variant of a generative adversarial
network called StarGAN.
I get this result:
['This paper proposes a method that allows non-paral',
'lel many-to-many voice conversion by using a varia',
'nt of a generative adversarial network called Star']
But ideally I would like to get something similar to this result:
['This paper proposes a method that allows non-parallel',
'many-to-many voice conversion by using a variant',
'of a generative adversarial network called StarGAN.']
How could I adjust the above-given code to get the desired result?
Answers:
You can use .{0,50}S*
in order to keep matching eventual further non-space characters (S
).
I specified 0
as lowerbound since otherwise you’d risk missing the last substring.
See a demo here.
EDIT:
For excluding the trailing empty chunk, use .{1,50}S*
, in order to force it to match at least one character.
If you also want to automatically strip the side spaces, use s*(.{1,50}S*)
.
For me this sound like task for textwrap
built-in module, example using your data
import textwrap
text = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."
print(textwrap.fill(text,55))
output
This paper proposes a method that allows non-parallel
many-to-many voice conversion by using a variant of a
generative adversarial network called StarGAN.
You will probably need some trials to get value which suits your needs best. If you need list
of str
s use textwrap.wrap
i.e. textwrap.wrap(text,55)
def nearestDelimiter(txt, cur):
delimiters = " ;:.!?-—"
if(txt[cur] in delimiters) :
return cur
else:
i=cur
while ( i>=0 ):
if (txt[i] in delimiters) :
return i
i=i-1
return 0
def splitText(sentence,chunkLength):
cursor = 0
curlng = chunkLength
lst = []
while (curlng < len(sentence)):
curlng = nearestDelimiter(sentence, curlng)
substr = (sentence[cursor : curlng]).strip()
cursor = curlng
curlng = (cursor+chunkLength) if (cursor+chunkLength<len(sentence)) else len(sentence)
lst.append(substr)
lst.append((sentence[cursor : curlng]).strip())
return lst
txt = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."
cvv = splitText(txt,50)
for cv in cvv:
print(cv)
I have a bunch of text samples. Each sample has a different length, but all of them consist of >200 characters. I need to split each sample into approx 50 chara ters length substrings. To do so, I found this approach:
import re
def chunkstring(string, length):
return re.findall('.{%d}' % length, string)
However, it splits a text by splitting words. For example, the phrase "I have <…> icecream. <…>" can be split into "I have <…> icec" and "ream. <…>".
This is the sample text:
This paper proposes a method that allows non-parallel many-to-many
voice conversion by using a variant of a generative adversarial
network called StarGAN.
I get this result:
['This paper proposes a method that allows non-paral',
'lel many-to-many voice conversion by using a varia',
'nt of a generative adversarial network called Star']
But ideally I would like to get something similar to this result:
['This paper proposes a method that allows non-parallel',
'many-to-many voice conversion by using a variant',
'of a generative adversarial network called StarGAN.']
How could I adjust the above-given code to get the desired result?
You can use .{0,50}S*
in order to keep matching eventual further non-space characters (S
).
I specified 0
as lowerbound since otherwise you’d risk missing the last substring.
See a demo here.
EDIT:
For excluding the trailing empty chunk, use .{1,50}S*
, in order to force it to match at least one character.
If you also want to automatically strip the side spaces, use s*(.{1,50}S*)
.
For me this sound like task for textwrap
built-in module, example using your data
import textwrap
text = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."
print(textwrap.fill(text,55))
output
This paper proposes a method that allows non-parallel
many-to-many voice conversion by using a variant of a
generative adversarial network called StarGAN.
You will probably need some trials to get value which suits your needs best. If you need list
of str
s use textwrap.wrap
i.e. textwrap.wrap(text,55)
def nearestDelimiter(txt, cur):
delimiters = " ;:.!?-—"
if(txt[cur] in delimiters) :
return cur
else:
i=cur
while ( i>=0 ):
if (txt[i] in delimiters) :
return i
i=i-1
return 0
def splitText(sentence,chunkLength):
cursor = 0
curlng = chunkLength
lst = []
while (curlng < len(sentence)):
curlng = nearestDelimiter(sentence, curlng)
substr = (sentence[cursor : curlng]).strip()
cursor = curlng
curlng = (cursor+chunkLength) if (cursor+chunkLength<len(sentence)) else len(sentence)
lst.append(substr)
lst.append((sentence[cursor : curlng]).strip())
return lst
txt = "This paper proposes a method that allows non-parallel many-to-many voice conversion by using a variant of a generative adversarial network called StarGAN."
cvv = splitText(txt,50)
for cv in cvv:
print(cv)