How to extract image from table in MS Word document with docx library?
Question:
I am working on a program that needs to extract two images from a MS Word document to use them in another document. I know where the images are located (first table in the document), but when I try to extract any information from the table (even just plain text), I get empty cells.
Here is the Word document that I want to extract the images from. I want to extract the ‘Rentel’ images from the first page (first table, row 0 and 1, column 2).
I have tried to try the following code:
from docxtpl import DocxTemplate
source_document = DocxTemplate("Source document.docx")
# It doesn't really matter which rows or columns I use for the cells, everything is empty
print(source_document.tables[0].cell(0,0).text)
Which just gives me empty lines…
I have read on this discussion and this one that the problem might be that “contained in a wrapper element that Python Docx cannot read”. They suggest altering the source document, but I want to be able to select any document that was previously created with the same template as a source document (so those documents also contain the same problem and I cannot change every document separately). So a Python-only solution is really the only way I can think about solving the problem.
Since I also only want those two specific images, extracting any random image from the xml by unzipping the Word file doesn’t really suit my solution, unless I know which image name I need to extract from the unzipped Word file folders.
I really want this to work as it is part of my thesis (and I’m just an electromechanical engineer, so I don’t know that much about software).
[EDIT]: Here is the xml code for the first image (source_document.tables[0].cell(0,2)._tc.xml
) and here it is for the second image (source_document.tables[0].cell(1,2)._tc.xml
). I noticed however that taking (0,2) as row and column value, gives me all the rows in column 2 within the first “visible” table. Cell (1,2) gives me all the rows in column 2 within the second “visible” table.
If the problem isn’t directly solvable with Python Docx, is it a possibility to search for the image name or ID or something within the XML code and then add the image using this ID/name with Python Docx?
Answers:
Well, the first thing that jumps out is that both of the cells (w:tc
elements) you posted each contain a nested table. This is perhaps unusual, but certainly a valid composition. Maybe they did that so they could include a caption in a cell below the image or something.
To access the nested table you’d have to do something like:
outer_cell = source_document.tables[0].cell(0,2)
nested_table = outer_cell.tables[0]
inner_cell_1 = nested_table.cell(0, 0)
print(inner_cell_1.text)
# ---etc....---
I’m not sure that solves your whole problem, but it strikes me that this is two or more questions in the end, the first being: “Why isn’t my table cell showing up?” and the second perhaps being “How do I get an image out of a table cell?” (once you’ve actually found the cell in question).
For the people who have the same problem, this is the code that helped me solve it:
First I extract the nested cell from the table using the following method:
@staticmethod
def get_nested_cell(table, outer_row, outer_column, inner_row, inner_column):
"""
Returns the nested cell (table inside a table) of the *document*
:argument
table: [docx.Table] outer table from which to get the nested table
outer_row: [int] row of the outer table in which the nested table is
outer_column: [int] column of the outer table in which the nested table is
inner_row: [int] row in the nested table from which to get the nested cell
inner_column: [int] column in the nested table from which to get the nested cell
:return
inner_cell: [docx.Cell] nested cell
"""
# Get the global first cell
outer_cell = table.cell(outer_row, outer_column)
nested_table = outer_cell.tables[0]
inner_cell = nested_table.cell(inner_row, inner_column)
return inner_cell
Using this cell, I can get the xml code and extract the image from that xml code. Note:
- I didn’t set the image width and height because I wanted it to be the same
- In the
replace_logos_from_source
method I know that the table where I want to get the logos from is ‘tables[0]’ and that the nested table is in outer_row and outer_column ‘0’, so I just filled it in the get_nested_cell
method without adding extra arguments to replace_logos_from_source
def replace_logos_from_source(self, source_document, target_document, inner_row, inner_column):
"""
Replace the employer and client logo from the *source_document* to the *target_document*. Since the table
in which the logos are placed are nested tables, the source and target cells with *inner_row* and
*inner_column* are first extracted from the nested table.
:argument
source_document: [DocxTemplate] document from which to extract the image
target_document: [DocxTemplate] document to which to add the extracted image
inner_row: [int] row in the nested table from which to get the image
inner_column: [int] column in the nested table from which to get the image
:return
Nothing
"""
# Get the target and source cell (I know that the table where I want to get the logos from is 'tables[0]' and that the nested table is in outer_row and outer_column '0', so I just filled it in without adding extra arguments to the method)
target_cell = self.get_nested_cell(target_document.tables[0], 0, 0, inner_row, inner_column)
source_cell = self.get_nested_cell(source_document.tables[0], 0, 0, inner_row, inner_column)
# Get the xml code of the inner cell
inner_cell_xml = source_cell._tc.xml
# Get the image from the xml code
image_stream = self.get_image_from_xml(source_document, inner_cell_xml)
# Add the image to the target cell
paragraph = target_cell.paragraphs[0]
if image_stream: # If not None (image exists)
run = paragraph.add_run()
run.add_picture(image_stream)
else:
# Set the target cell text equal to the source cell text
paragraph.add_run(source_cell.text)
@staticmethod
def get_image_from_xml(source_document, xml_code):
"""
Returns the rId for an image in the *xml_code*
:argument
xml_code: [string] xml code from which to extract the image from
:return
image_stream: [BytesIO stream] the image to find
None if no image exists in the xml_file
"""
# Parse the xml code for the blip
xml_parser = minidom.parseString(xml_code)
items = xml_parser.getElementsByTagName('a:blip')
# Check if an image exists
if items:
# Extract the rId of the image
rId = items[0].attributes['r:embed'].value
# Get the blob of the image
source_document_part = source_document.part
image_part = source_document_part.related_parts[rId]
image_bytes = image_part._blob
# Write the image bytes to a file (or BytesIO stream) and feed it to document.add_picture(), maybe:
image_stream = BytesIO(image_bytes)
return image_stream
# If no image exists
else:
return None
To call the method, I used:
# Replace the employer and client logos
self.replace_logos_from_source(self.source_document, self.template_doc, 0, 2) # Employer logo
self.replace_logos_from_source(self.source_document, self.template_doc, 1, 2) # Client logo
I am working on a program that needs to extract two images from a MS Word document to use them in another document. I know where the images are located (first table in the document), but when I try to extract any information from the table (even just plain text), I get empty cells.
Here is the Word document that I want to extract the images from. I want to extract the ‘Rentel’ images from the first page (first table, row 0 and 1, column 2).
I have tried to try the following code:
from docxtpl import DocxTemplate
source_document = DocxTemplate("Source document.docx")
# It doesn't really matter which rows or columns I use for the cells, everything is empty
print(source_document.tables[0].cell(0,0).text)
Which just gives me empty lines…
I have read on this discussion and this one that the problem might be that “contained in a wrapper element that Python Docx cannot read”. They suggest altering the source document, but I want to be able to select any document that was previously created with the same template as a source document (so those documents also contain the same problem and I cannot change every document separately). So a Python-only solution is really the only way I can think about solving the problem.
Since I also only want those two specific images, extracting any random image from the xml by unzipping the Word file doesn’t really suit my solution, unless I know which image name I need to extract from the unzipped Word file folders.
I really want this to work as it is part of my thesis (and I’m just an electromechanical engineer, so I don’t know that much about software).
[EDIT]: Here is the xml code for the first image (source_document.tables[0].cell(0,2)._tc.xml
) and here it is for the second image (source_document.tables[0].cell(1,2)._tc.xml
). I noticed however that taking (0,2) as row and column value, gives me all the rows in column 2 within the first “visible” table. Cell (1,2) gives me all the rows in column 2 within the second “visible” table.
If the problem isn’t directly solvable with Python Docx, is it a possibility to search for the image name or ID or something within the XML code and then add the image using this ID/name with Python Docx?
Well, the first thing that jumps out is that both of the cells (w:tc
elements) you posted each contain a nested table. This is perhaps unusual, but certainly a valid composition. Maybe they did that so they could include a caption in a cell below the image or something.
To access the nested table you’d have to do something like:
outer_cell = source_document.tables[0].cell(0,2)
nested_table = outer_cell.tables[0]
inner_cell_1 = nested_table.cell(0, 0)
print(inner_cell_1.text)
# ---etc....---
I’m not sure that solves your whole problem, but it strikes me that this is two or more questions in the end, the first being: “Why isn’t my table cell showing up?” and the second perhaps being “How do I get an image out of a table cell?” (once you’ve actually found the cell in question).
For the people who have the same problem, this is the code that helped me solve it:
First I extract the nested cell from the table using the following method:
@staticmethod
def get_nested_cell(table, outer_row, outer_column, inner_row, inner_column):
"""
Returns the nested cell (table inside a table) of the *document*
:argument
table: [docx.Table] outer table from which to get the nested table
outer_row: [int] row of the outer table in which the nested table is
outer_column: [int] column of the outer table in which the nested table is
inner_row: [int] row in the nested table from which to get the nested cell
inner_column: [int] column in the nested table from which to get the nested cell
:return
inner_cell: [docx.Cell] nested cell
"""
# Get the global first cell
outer_cell = table.cell(outer_row, outer_column)
nested_table = outer_cell.tables[0]
inner_cell = nested_table.cell(inner_row, inner_column)
return inner_cell
Using this cell, I can get the xml code and extract the image from that xml code. Note:
- I didn’t set the image width and height because I wanted it to be the same
- In the
replace_logos_from_source
method I know that the table where I want to get the logos from is ‘tables[0]’ and that the nested table is in outer_row and outer_column ‘0’, so I just filled it in theget_nested_cell
method without adding extra arguments toreplace_logos_from_source
def replace_logos_from_source(self, source_document, target_document, inner_row, inner_column):
"""
Replace the employer and client logo from the *source_document* to the *target_document*. Since the table
in which the logos are placed are nested tables, the source and target cells with *inner_row* and
*inner_column* are first extracted from the nested table.
:argument
source_document: [DocxTemplate] document from which to extract the image
target_document: [DocxTemplate] document to which to add the extracted image
inner_row: [int] row in the nested table from which to get the image
inner_column: [int] column in the nested table from which to get the image
:return
Nothing
"""
# Get the target and source cell (I know that the table where I want to get the logos from is 'tables[0]' and that the nested table is in outer_row and outer_column '0', so I just filled it in without adding extra arguments to the method)
target_cell = self.get_nested_cell(target_document.tables[0], 0, 0, inner_row, inner_column)
source_cell = self.get_nested_cell(source_document.tables[0], 0, 0, inner_row, inner_column)
# Get the xml code of the inner cell
inner_cell_xml = source_cell._tc.xml
# Get the image from the xml code
image_stream = self.get_image_from_xml(source_document, inner_cell_xml)
# Add the image to the target cell
paragraph = target_cell.paragraphs[0]
if image_stream: # If not None (image exists)
run = paragraph.add_run()
run.add_picture(image_stream)
else:
# Set the target cell text equal to the source cell text
paragraph.add_run(source_cell.text)
@staticmethod
def get_image_from_xml(source_document, xml_code):
"""
Returns the rId for an image in the *xml_code*
:argument
xml_code: [string] xml code from which to extract the image from
:return
image_stream: [BytesIO stream] the image to find
None if no image exists in the xml_file
"""
# Parse the xml code for the blip
xml_parser = minidom.parseString(xml_code)
items = xml_parser.getElementsByTagName('a:blip')
# Check if an image exists
if items:
# Extract the rId of the image
rId = items[0].attributes['r:embed'].value
# Get the blob of the image
source_document_part = source_document.part
image_part = source_document_part.related_parts[rId]
image_bytes = image_part._blob
# Write the image bytes to a file (or BytesIO stream) and feed it to document.add_picture(), maybe:
image_stream = BytesIO(image_bytes)
return image_stream
# If no image exists
else:
return None
To call the method, I used:
# Replace the employer and client logos
self.replace_logos_from_source(self.source_document, self.template_doc, 0, 2) # Employer logo
self.replace_logos_from_source(self.source_document, self.template_doc, 1, 2) # Client logo