How to split input text into equal size of tokens, not character length, and then concatenate the summarization results for Hugging Face transformers

Question:

I am using the below methodology to summarize longer than 1024 token size long texts.

Current method splits the text by half. I took this from another user’s post and modified it slightly.

So what I want to do is, instead of splitting into half, split whole text into 1024 equal sized tokens and get summarization each of them and then at the end, concatenate them with the correct order and write into file. How can I do this tokenization and getting the correct output?

text split with Split(" ") doesn’t work same as tokenization. It produces different count.

import logging
from transformers import pipeline

f = open("TextFile1.txt", "r")

ARTICLE = f.read()

summarizer = pipeline("summarization", model="facebook/bart-large-cnn" )

counter = 1

def summarize_text(text: str, max_len: int) -> str:
    global counter
    try:
        #logging.warning("max_len " + str(max_len))
        summary = summarizer(text, min_length=30, do_sample=False)
        with open('parsed_'+str(counter)+'.txt', 'w') as f:
            f.write(text)
        counter += 1
        return summary[0]["summary_text"]
    except IndexError as ex:
        logging.warning("Sequence length too large for model, cutting text in half and calling again")
        return summarize_text(text=text[:(len(text) // 2)], max_len=max_len) + " " + summarize_text(text=text[(len(text) // 2):], max_len=max_len)

gg = summarize_text(ARTICLE, 1024)

with open('summarized.txt', 'w') as f:
    f.write(gg)
Asked By: MonsterMMORPG

||

Answers:

I like splitting text using nltk. You can also do it with spacy and the quality is better, but it takes a bit longer. nltk and spacy allow you to cut text into sentences and this is better because the text pieces are more coherent. You want to cut it less than 1024 to be on the safe side. 512 should be better and it’s what the original BERT uses, so it shouldn’t be too bad. You just summarize the summarizations in the end. Here’s an example:

import nltk
from nltk.tokenize import sent_tokenize

def split_in_segments(text):
    tokens = 0
    mystring = list()
    segments = []
    for sent in sent_tokenize(text):
        newtokens = len(sent.split())
        tokens += newtokens
        mystring.append(str(sent).strip())
        if tokens > 512:
            segments.append(" ".join(mystring))
            mystring = []
            tokens = 0
    if mystring:
        segments.append(" ".join(mystring))
    return(segments)

def summarize_4_plotly(text):
    segments = split_in_segments(text)
    summarylist = summarizer(segments, max_length=100, min_length=30, do_sample=False)
    summary = summarizer(" ".join([summarylist[i]['summary_text'] for i in range(len(summarylist))]), max_length = 120, min_length = 30, do_sample = False)
    return(summary)

summarize_4_plotly(text)
Answered By: Wilmer E. Henao