Extracting Powerpoint background images using python-pptx

Question:

I have several powerpoints that I need to shuffle through programmatically and extract images from. The images then need to be converted into OpenCV format for later processing/analysis. I have done this successfully for images in the pptx, using:

for slide in presentation:
    for shape in slide.shapes
        if 'Picture' in shape.name:
            pic_list.append(shape)

for extraction, and:

img = cv2.imdecode(np.frombuffer(page[i].image.blob, np.uint8), cv2.IMREAD_COLOR)

for python-pptx Picture to OpenCV conversion. However, I am having a lot of trouble extracting and manipulating the backgrounds in a similar fashion.

slide.background

is sufficient to extract a "_Background" object, but I have not found a good way to convert it into a OpenCV object similar to Pictures. Does anyone know how to do this? I am using python-pptx for extraction, but am not adverse to other packages if it’s not possible with that package.

Asked By: tq343

||

Answers:

After a fair bit of work I discovered how to do this — i.e., you don’t. As far as I can tell, there is no way to directly extract the backgrounds with either python-pptx or Aspose. Powerpoint — which, as it turns out, is an archive that can be unzipped with 7zip — keeps its backgrounds disassembled in the ppt/media (pics), ppt/slideLayouts and ppt/slideMasters (text, formatting), and they are only put together by the Powerpoint renderer. This means that to extract the backgrounds as displayed, you basically need to run Powerpoint and take pics of the slides after removing text/pictures/etc. from the foreground.

I did not need to do this, as I just needed to extract text from the backgrounds. This can be done by checking slideLayouts and slideMasters XMLs using BeautifulSoup, at the <a:t> tag. The code to do this is pretty simple:

import zipfile
with zipfile.ZipFile(pptx_path, 'r') as zip_ref:
    zip_ref.extractall(extraction_directory)

This will extract the .pptx into its component files.

from glob import glob
layouts = glob(os.path.join(extr_dir, 'pptslideLayouts*.xml'))
masters = glob(os.path.join(extr_dir, 'pptslideMasters*.xml'))
files = layouts + masters

This gets you the paths for slide layouts/masters.

from bs4 import BeautifulSoup    
text_list = []
    for file in files:
        with open(file) as f:
            data = f.read()
        bs_data = BeautifulSoup(data, "xml")
        bs_a_t = bs_data.find_all('a:t')
        for a_t in bs_a_t:
            text_list.append(str(a_t.contents[0]))

This will get you the actual text from the XMLs.

Hopefully this will be useful to someone else in the future.

Answered By: tq343
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.