How to extract pictures as enhanced metafile from word documents in python?

Question:

I want to extract in an automatic way images from a word document. The images are excel charts pasted as picture (enhanced metafile) into the worddoc.

After a quick research I tried to use the following method

import docx2txt as d2t 

def extract_images_from_docx(path_to_file, images_folder, get_text = False): 
    text = d2t.process(path_to_file, images_folder)

    if get_text:
        return text

path_to_file = './Report.docx'
images_folder = './Img/'

extract_images_from_docx(path_to_file, images_folder, False)

However, this method does NOT work. I am almost sure that this is due to the format of the pictures. Indeed, when I pasted a normal png image into one word doc I was then able to get it with the above code.

I have also tried to convert the document to PDF and try to extract images from there with NO result

from docx2pdf import convert

convert('./Report.docx')
convert('./Report.docx', './Report.pdf')

import fitz  # PyMuPDF


def get_pixmaps_in_pdf(pdf_filename):
    doc = fitz.open(pdf_filename)
    xrefs = set()
    for page_index in range(doc.page_count):
        for image in doc.get_page_images(page_index):
            xrefs.add(image[0])  # Add XREFs to set so duplicates are ignored
    pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
    doc.close()
    return pixmaps


def write_pixmaps_to_pngs(pixmaps):
    for i, pixmap in enumerate(pixmaps):
        pixmap.save(f'{i}.png')  # Might want to come up with a better name


pixmaps = get_pixmaps_in_pdf('./Report.pdf')
write_pixmaps_to_pngs(pixmaps)

So, does anyone one know if there is a way to automatically extract excel charts pasted as enhanced metafile in a word doc?

Thank you in advance for your help !

Answers:

The crazy thing is .docx files are actually secretly .zip files, I’ve been able to successfully extract images from a .docx using the zipfile module. The images should live in the word/media directory of the extracted .zip. I dunno if the enhanced metafiles live there too, but that’s my best guess. Here’s something to get you started:

import os
import zipfile

input_docx = [NAME_OF_DOCX]
archive = zipfile.ZipFile(f'{input_docx}.docx')
for file in archive.filelist:
    archive.extract(file, 'extracted_docx')
for file in os.listdir('extracted_docx\word\media'):
    if file.endswith('.emf'):
        # do something with the file
        pass

(untested, but should work)

Answered By: Brock Brown
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.