Python: Recursively split strings when longer than max allowed characters, by the last occurrence of a delimiter found before max allowed characters

Question:

I have a text transcript of dialogue, consisting of strings of variable length. The string lengths can be anywhere from a few characters to thousands of characters.

I want Python to transform the text so that any line is maximally n characters. To make the partitioning natural, I want to recursively partition the lines by the last occurrence of any of the delimiters . , , , ? , ! . For example, let’s assume that the below 72-character string is above a threshold of 36 characters:

This, is a long, long string. It is around(?) 72 characters! Pretty cool

Since the string is longer than 36 characters, the function should recursively partition the string by the last occurrence of any delimiter within 36 characters. Recursively meaning that if the resulting partitioned strings are longer than 36 characters, they should also be split according to the same rule. In this case, it should result in a list like:

['This, is a long, long string. ', 'It is around(?) 72 characters! ', 'Pretty cool']

The list items are respectively 30, 31, and 11 characters. None were allowed to be over 36 characters long. Note that the partitions in this example do not occur at a , delimiter, because those weren’t the last delimiters within the 36+ character threshold.

The partition sequence would’ve been something like:

'This, is a long, long string. It is around(?) 72 characters! Pretty cool'           #  72
['This, is a long, long string. ', 'It is around(?) 72 characters! Pretty cool']     #  30 + 42
['This, is a long, long string. ', 'It is around(?) 72 characters! ', ' Pretty cool'] # 30 + 31 + 11

In the odd situation that there are no delimiters in the string or resulting recursive partitions, the strings should be wrapped using something like textwrap.wrap() to max 36 characters, which produces a list which in the absence of delimiters would be:

['There are no delimiters here so I am', ' partitioned at 36 characters'] # 36 + 29

I’ve tried to work out a Python function algorithm to achieve this, but it has been difficult. I spent long time in ChatGPT and couldn’t get it to work despite many prompts.

Is there a Python module function that can achieve this already, or alternatively can you suggest a function will solve this problem?


NB: Character count online tool: https://www.charactercountonline.com/

Asked By: P A N

||

Answers:

You can use rfind to get the last occurrence of a delimiter in the first n characters of a string.

def partition(s, n):
    if len(s) <= n: return [s]
    idx = max(s.rfind(c, 0, n) for c in ['.', ',', '?', '!'])
    return [s] if idx == -1 else [s[0:idx+2], *partition(s[idx+2:], n)]
print(partition('This, is a long, long string. It is around(?) 72 characters! Pretty cool', 36))
Answered By: Unmitigated
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.