How to split a document into approximately 500 word chunks in python using spacy?

Question:

I’m trying to split up some long documents for use with an Open AI back end, so they need to be broken up into chunks of roughly 500 words/3000 characters or fewer. I’m using spacy to try and break these chunks into sentences rather than either tokens or characters so that when the documents are questioned by the AI later, there is enough context for an answer, and the documents remain human readable if fetched.

I’ve used the spacy sentences library to split my text document into sentences, and i’m now trying to write these to a new list using a for/while loop, the idea being that when the first list reaches 3000 characters the loop moves on to write them to the next list.

fullText = """ Multiple sentences of text go here. Here are some example sentences for testing the code with. 

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec feugiat nisl nibh, imperdiet tincidunt nunc pellentesque a. Suspendisse potenti. Praesent tristique nisi nec leo vestibulum, nec lacinia arcu finibus. Phasellus pulvinar lacus et felis viverra congue. Pellentesque bibendum et ipsum eget vestibulum. Aenean non nisl molestie, consequat purus in, bibendum dolor. Donec volutpat, leo a ultricies lobortis, velit libero porttitor massa, fermentum maximus sem arcu in magna. In suscipit posuere scelerisque. Aenean id egestas dolor. Sed molestie sit amet dolor vitae gravida. Donec nec odio nisl. Nam bibendum consequat eros et finibus. Suspendisse eget lacus sed sapien finibus molestie a eu nisi. Vivamus rutrum mi sit amet urna finibus vestibulum vel ac ante. Maecenas dapibus velit at ex rhoncus gravida. Phasellus non arcu vitae orci mattis finibus sit amet consectetur dolor.

Morbi a commodo lorem, eu aliquam odio. Mauris consectetur eros lacus, in malesuada risus eleifend et. Duis non lobortis mauris. Mauris mattis eu sapien et condimentum. Duis eu est sodales nunc venenatis maximus id et turpis. Phasellus sodales ac neque et interdum. Donec consequat augue eros, sed sodales sem malesuada eget. Pellentesque vitae tincidunt lorem. Nulla laoreet arcu eu varius pharetra. Duis tincidunt enim libero, non consectetur libero aliquet cursus. Mauris in posuere urna. Vivamus nec elementum mauris. Aliquam tempor rhoncus suscipit.

Vivamus justo mauris, euismod et dui ut, hendrerit semper elit. Vivamus interdum erat non dui ultrices sollicitudin. Interdum et malesuada fames ac ante ipsum primis in faucibus. Fusce arcu nunc, mattis in eleifend et, facilisis vel ligula. Quisque hendrerit sodales finibus. Proin ullamcorper sapien vitae diam lacinia, et feugiat lacus varius. Fusce massa justo, suscipit eget blandit at, tempor non mauris. Nulla ac felis laoreet, aliquam lacus fermentum, accumsan ex. Donec pretium, risus non luctus aliquam, tortor nisi fringilla lacus, in aliquet leo orci ac libero. Integer cursus quam lectus, in ullamcorper justo rhoncus id. Phasellus malesuada lacinia augue, vitae aliquam magna laoreet at. Suspendisse non vehicula nisl. In suscipit sem felis, sed elementum metus porttitor quis. Aenean sit amet libero at dui pellentesque varius vel eu est. Quisque urna mauris, vestibulum et eros ut, lobortis pretium.
"""
import spacy
from spacy.lang.en import English

empty_string = []
empty_string_2 = []
def sentenceTokenize(fullText):
    nlp = English()  # just the language with no pipeline
    nlp.add_pipe("sentencizer")
    doc = nlp(fullText)
    for sent in doc.sents:
        while len(empty_string) <= 3000:
            empty_string.append(sent.text)
        else:
            empty_string_2.append(sent.text)
    return empty_string, empty_string_2

I’m fairly sure i’ve gotten something slightly awry here, because my current outcome from this function is 3000 characters worth of the first sentence written to the first empty list, and then the entire document written to the second empty list.

I’d be very grateful for any guidance as to where i’m going wrong, or indeed if i’m just approaching this from entirely the wrong direction in the first place!

Asked By: howlieT

||

Answers:

The behavior you described

3000 characters worth of the first sentence written to the first empty list, and then the entire document written to the second empty list.

is exactly what you coded here:

while len(empty_string) <= 3000:
    empty_string.append(sent.text)
else:
    empty_string_2.append(sent.text)

What this does is put the first 3000 characters in the first list in the while and the puts the rest in the second list with the else.

From what I understand you want something more like this:

while len(one_string) <= 3000:
    one_string.append(sent.text)
else:
    string_list.append(one_string)
    one_string.clear()
    one_string.append(sent.text)

This resets on of your lists every 3000 characters after adding it to the other list as 1 item.
Then you can return string_list and have all your 3000 word chunks in one variable and can access it however you like.

I also renamed the variables to make their usage more clear in this case 🙂

Answered By: Sn3nS
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.