python-pptx: Getting odd splits when extracting text from slides

Question:

I’m using the “Extract all text from slides in presentation” example at https://python-pptx.readthedocs.io/en/latest/user/quickstart.html to extract text from some PowerPoint slides.

from pptx import Presentation

prs = Presentation(path_to_presentation)

# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = []

for slide in prs.slides:
    for shape in slide.shapes:
        if not shape.has_text_frame:
            continue
        for paragraph in shape.text_frame.paragraphs:
            for run in paragraph.runs:
                text_runs.append(run.text)

It seems to be working fine, except that I’m getting odd splits in some of the text_runs. Things that I’d expect would be grouped together are being split up, and with no obvious pattern that I can detect. For example, sometimes the slide title is split into two parts, and sometimes it isn’t

I’ve discovered that I can eliminate the odd splits by retyping the text on the slide but that doesn’t scale.

I can’t, or at least don’t want to, merge the two parts of the split text together, because sometimes the second part of the text has been merged with a different text run. For example, on the slide deck’s title slide, the title will be split in two, with the second part of the title merged with the title slide’s subtitle text.

Any suggestions on how to eliminate the odd / unwanted splits? Or is this behavior more-or-less to be expected when reading text from a PowerPoint?

Asked By: BobInBaltimore

||

Answers:

I’d say it’s definitely to be expected. PowerPoint will split runs whenever it pleases, perhaps to highlight a misspelled word or just if you pause in typing or go in to fix a typo or something.

The only thing that can be said for sure about a run is that all the characters it contains share the same character formatting. There’s no guarantee, for example, that the run is what one might call “greedy”, including as many characters as possible that do share the same character formatting.

If you want to reconstruct that “greedy” coherence in the runs, it will be up to you, perhaps with an algorithm like this:

last_run = None
for run in paragraph.runs:
    if last_run is None:
        last_run = run
        continue
    if has_same_formatting(run, last_run):
        last_run = combine_runs(last_run, run)
        continue
    last_run = run

That leaves you to implement has_same_formatting() and combine_runs(). There’s a certain advantage here, because runs can contain differences you don’t care about, like a dirty attribute or whatever, and you can pick and choose which ones matter to you.

A start of an implementation of has_same_formatting() would be:

def has_same_formatting(run, run_2):
    font, font_2 = run.font, run_2.font
    if font.bold != font_2.bold:
        return False
    if font.italic != font_2.italic:
        return False
    # ---same with color, size, type-face, whatever you want---
    return True

combine_runs(base, suffix) would look something like this:

def combine_runs(base, suffix):
    base.text = base.text + suffix.text
    r_to_remove = suffix._r
    r_to_remove.getparent().remove(r_to_remove)
Answered By: scanny

@TheGreat – Here’s my final code block. I’m not sure how thoroughly tested it as. As I mention elsewhere, IIRC something else came up and I never really got back to this "In my spare time" project.

try:
    import pptx
except ImportError:
    print("Error when trying to import the pptx module to bobs_useful_functions.py.")
    print("Please install a current version of the python-pptx library.")
    sys.exit(1)
try:
    import pptx.exc
except ImportError:
    print("Error when trying to import the pptx.exc module to bobs_useful_functions.py.")
    print("Please install a current version of the python-pptx library.")
    sys.exit(1)

from pptx import Presentation
from pptx.exc import PackageNotFoundError

def read_text_from_powerpoint(path_to_presentation, only_first_slide=True):

# Adapted from an example at https://python-pptx.readthedocs.io/en/latest/user/quickstart.html
# and the StackOverflow question "python-pptx Extract text from slide titles.
#
# Note: Using the "runs" method described in the python-pptx QuickStart example occasionally
#       resulted in breaks in the text read from the slide, for no obvious reason.

try:
    prs = Presentation(path_to_presentation)

    # text_runs will be populated with a list of strings,
    # one for each text run in presentation
    text_runs = []

    slide_counter = 0
    for slide in prs.slides:
        slide_counter += 1
        if slide_counter == 1:
            for shape in slide.shapes:
                if not shape.has_text_frame:
                    continue
                text_runs.append(shape.text)
        else:
            if only_first_slide:
                break
            else:
                for shape in slide.shapes:
                    if not shape.has_text_frame:
                        continue
                    for paragraph in shape.text_frame.paragraphs:
                        for run in paragraph.runs:
                            text_runs.append(run.text)

    if only_first_slide:
        # This assumes the first string in "text_runs" is the title, which in turn assumes
        # the first slide HAS a title.
        title = ''.join(text_runs[:1])  # Basically, convert from a one-element list to a string
        # Joint with a space between the elements of 'text_runs'.  For the first slide, this would
        # be what's typically thought of as the slide subtitle, plus any notes or comments also on
        # the first slide.
        subtitle = ' '.join(text_runs[1:])
        output = [title, subtitle]
    else:
        output = text_runs

except PackageNotFoundError:
    print("nWARNING: Unable to open the presentation:n    %s" % path_to_presentation)
    print("The presentation may be password protected.")
    # Note that this output text is a treated as a flag value.
    # For that reason, be EXTREMELY careful about changing this output text.
    output = ['PackageNotFoundError - Possible password-protected PowerPoint']

return output
Answered By: BobInBaltimore
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.