Read From PowerPoint Table in Python?
Question:
I am using the python pptx module to automatically update values in a powerpoint file. I am able to extract all the text in the file using the code below:
from pptx import Presentation
prs = Presentation(path_to_presentation)
# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = []
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text_runs.append(run.text)
This code will extract all the text in a file but fails to extract text that is in a ppt table and I would like to update some of these values. I have tried to implement some code from this question: Reading text values in a PowerPoint table using pptx? but could not. Any Ideas? Thanks.
Answers:
Your code will miss more text than just tables; it won’t see text in shapes that are part of groups, for example.
For tables, you’ll need to do a couple things:
Test the shape to see if the shape’s .HasTable property is true. If so, you can work with the shape’s .Table object to extract the text. Conceptually, and very aircode:
For r = 1 to tbl.rows.count
For c = 1 to tbl.columns.count
tbl.cell(r,c).Shape.Textframe.Text ' is what you're after
This works for me:
def access_table():
slide = prs.slides[0] #first slide
table = slide.shapes[2].table # maybe 0..n
for r in table.rows:
s = ""
for c in r.cells:
s += c.text_frame.text + " | "
#to write
#c.text_frame.text = "example"
print s
How to Extract All of the Text out of Tables Inside of a Slide-show Presentation
The following code extracts text from tables in a slide-show presentation. Text in the presentation outside of tables is omitted, but you can modify my code to capture text from non-table objects as well.
import pptx as pptx
from pptx import *
def get_tables_from_presentation(pres):
"""
The input parameter `pres` should receive
an object returned by `pptx.Presentation()`
EXAMPLE:
```
import pptx
p = "C:\Users\user\Desktop\power_point_pres.pptx"
pres = pptx.Presentation(p)
tables = get_tables_from_presentation(pres)
```
"""
tables = list()
for slide in pres.slides:
for shp in iter(slide.shapes):
if shp.has_table:
table = shp.table
tables.append(table)
return tables
def iter_to_nonempty_table_cells(tbl):
"""
:param tbl: 'pptx.table.Table'
input table is NOT modified
:return: return iterator to non-empty rows
"""
for ridx in range(sum(1 for _ in iter(tbl.rows))):
for cidx in range(sum(1 for _ in iter(tbl.columns))):
cell = tbl.cell(ridx, cidx)
txt = type("")(cell.text)
txt = txt.strip()
if len(txt) > 1:
yield txt
# establish read path
in_file_path = "C:\Users\user\Desktop\power_point_pres.pptx"
# Open slide-show presentation
pres = Presentation(in_file_path)
# extract tables from slide-show presentation
tables = get_tables_from_presentation(pres)
for tbl in tables:
it = iter_to_nonempty_table_cells(tbl)
print("".join(it))
A Note About One of the Other Answers to This Question
Someone else posted a semi-useful answer to this question written in pseudo-code. They wrote the following:
For r = 1 to tbl.rows.count
For c = 1 to tbl.columns.count
tbl.cell(r,c).Shape.Textframe.Text
The problem is, that is not python.
In python, it is illegal syntax to write For r = 1 to 10
Instead, we would write something like the following:
for r in range(1, 11):
print(r)
from itertools import *
for r in takewhile(lambda k: k <= 10, count(1)):
print(r)
Additionally, the row indicies start at r = 0
not r = 1
The upper-left corner of the table is tbl.cell(0,0)
not tbl.cell(1,1)
There is no such thing as .count
for the rows attribute or the columns attribute. (For r = 1 to tbl.rows.count)
makes no sense because there is no such thing as tbl.rows.count
tbl.cell(r,c).Shape
won’t work, because objects instantiated from the class pptx.table._Cell
have no attribute named Shape
cell
objects have the following attributes:
fill
is_merge_origin
is_spanned
margin_bottom
margin_left
margin_right
margin_top
merge
part
span_height
span_width
split
text
text_frame
vertical_anchor
A fix is shown below:
# ----------------------------------------
# BEGIN SYNTACTICALLY INCORRECT CODE
# ----------------------------------------
# For r = 1 to tbl.rows.count
# For c = 1 to tbl.columns.count
# tbl.cell(r,c).Shape.Textframe.Text
# ----------------------------------------
# END SYNTACTICALLY INCORRECT CODE
# BEGIN SYNTACTICALLY CORRECT CODE
# ----------------------------------------
for r in range(sum(1 for row in iter(tbl.rows))):
for c in range(sum(1 for _ in iter(tbl.columns))):
print(tbl.cell(r,c).text)
# ----------------------------------------
# END SYNTACTICALLY CORRECT CODE
# ----------------------------------------
A Note About your Original Code
The continue
keyword
In your original source code, you have the following for-loop:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
That for-loop does not do anything.
The continue
keyword simply means "increment the loop-counter and jump to the beginning of the loop" However, there is no code after your continue
and before the end of the loop. That is, the loop would have continued anyway without you having to write continue
because it is already at the end of the loop-body.
To understand more about continue
consider the following example:
for k in [1, 2, 3, 4, 5]:
print("For k ==", k, "we have k % 2 == ", k % 2)
if not k % 2 == 0:
continue
print("For k ==", k, "we got past the `continue`")
The output is:
For k == 1 we have k % 2 == 1
For k == 2 we have k % 2 == 0
For k == 2 we got past the `continue`
For k == 3 we have k % 2 == 1
For k == 4 we have k % 2 == 0
For k == 4 we got past the `continue`
For k == 5 we have k % 2 == 1
The following three pieces of code all print the exact same messages, regardless of the use of the continue
keyword:
for k in [1, 2, 3, 4, 5]:
print(k)
for k in [1, 2, 3, 4, 5]:
print(k)
continue
for k in [1, 2, 3, 4, 5]:
print(k)
if float(k)//1 % 2 == 0:
continue
I am using the python pptx module to automatically update values in a powerpoint file. I am able to extract all the text in the file using the code below:
from pptx import Presentation
prs = Presentation(path_to_presentation)
# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = []
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text_runs.append(run.text)
This code will extract all the text in a file but fails to extract text that is in a ppt table and I would like to update some of these values. I have tried to implement some code from this question: Reading text values in a PowerPoint table using pptx? but could not. Any Ideas? Thanks.
Your code will miss more text than just tables; it won’t see text in shapes that are part of groups, for example.
For tables, you’ll need to do a couple things:
Test the shape to see if the shape’s .HasTable property is true. If so, you can work with the shape’s .Table object to extract the text. Conceptually, and very aircode:
For r = 1 to tbl.rows.count
For c = 1 to tbl.columns.count
tbl.cell(r,c).Shape.Textframe.Text ' is what you're after
This works for me:
def access_table():
slide = prs.slides[0] #first slide
table = slide.shapes[2].table # maybe 0..n
for r in table.rows:
s = ""
for c in r.cells:
s += c.text_frame.text + " | "
#to write
#c.text_frame.text = "example"
print s
How to Extract All of the Text out of Tables Inside of a Slide-show Presentation
The following code extracts text from tables in a slide-show presentation. Text in the presentation outside of tables is omitted, but you can modify my code to capture text from non-table objects as well.
import pptx as pptx
from pptx import *
def get_tables_from_presentation(pres):
"""
The input parameter `pres` should receive
an object returned by `pptx.Presentation()`
EXAMPLE:
```
import pptx
p = "C:\Users\user\Desktop\power_point_pres.pptx"
pres = pptx.Presentation(p)
tables = get_tables_from_presentation(pres)
```
"""
tables = list()
for slide in pres.slides:
for shp in iter(slide.shapes):
if shp.has_table:
table = shp.table
tables.append(table)
return tables
def iter_to_nonempty_table_cells(tbl):
"""
:param tbl: 'pptx.table.Table'
input table is NOT modified
:return: return iterator to non-empty rows
"""
for ridx in range(sum(1 for _ in iter(tbl.rows))):
for cidx in range(sum(1 for _ in iter(tbl.columns))):
cell = tbl.cell(ridx, cidx)
txt = type("")(cell.text)
txt = txt.strip()
if len(txt) > 1:
yield txt
# establish read path
in_file_path = "C:\Users\user\Desktop\power_point_pres.pptx"
# Open slide-show presentation
pres = Presentation(in_file_path)
# extract tables from slide-show presentation
tables = get_tables_from_presentation(pres)
for tbl in tables:
it = iter_to_nonempty_table_cells(tbl)
print("".join(it))
A Note About One of the Other Answers to This Question
Someone else posted a semi-useful answer to this question written in pseudo-code. They wrote the following:
For r = 1 to tbl.rows.count
For c = 1 to tbl.columns.count
tbl.cell(r,c).Shape.Textframe.Text
The problem is, that is not python.
In python, it is illegal syntax to write For r = 1 to 10
Instead, we would write something like the following:
for r in range(1, 11):
print(r)
from itertools import *
for r in takewhile(lambda k: k <= 10, count(1)):
print(r)
Additionally, the row indicies start at r = 0
not r = 1
The upper-left corner of the table is tbl.cell(0,0)
not tbl.cell(1,1)
There is no such thing as .count
for the rows attribute or the columns attribute. (For r = 1 to tbl.rows.count)
makes no sense because there is no such thing as tbl.rows.count
tbl.cell(r,c).Shape
won’t work, because objects instantiated from the class pptx.table._Cell
have no attribute named Shape
cell
objects have the following attributes:
fill
is_merge_origin
is_spanned
margin_bottom
margin_left
margin_right
margin_top
merge
part
span_height
span_width
split
text
text_frame
vertical_anchor
A fix is shown below:
# ----------------------------------------
# BEGIN SYNTACTICALLY INCORRECT CODE
# ----------------------------------------
# For r = 1 to tbl.rows.count
# For c = 1 to tbl.columns.count
# tbl.cell(r,c).Shape.Textframe.Text
# ----------------------------------------
# END SYNTACTICALLY INCORRECT CODE
# BEGIN SYNTACTICALLY CORRECT CODE
# ----------------------------------------
for r in range(sum(1 for row in iter(tbl.rows))):
for c in range(sum(1 for _ in iter(tbl.columns))):
print(tbl.cell(r,c).text)
# ----------------------------------------
# END SYNTACTICALLY CORRECT CODE
# ----------------------------------------
A Note About your Original Code
The continue
keyword
In your original source code, you have the following for-loop:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
That for-loop does not do anything.
The continue
keyword simply means "increment the loop-counter and jump to the beginning of the loop" However, there is no code after your continue
and before the end of the loop. That is, the loop would have continued anyway without you having to write continue
because it is already at the end of the loop-body.
To understand more about continue
consider the following example:
for k in [1, 2, 3, 4, 5]:
print("For k ==", k, "we have k % 2 == ", k % 2)
if not k % 2 == 0:
continue
print("For k ==", k, "we got past the `continue`")
The output is:
For k == 1 we have k % 2 == 1
For k == 2 we have k % 2 == 0
For k == 2 we got past the `continue`
For k == 3 we have k % 2 == 1
For k == 4 we have k % 2 == 0
For k == 4 we got past the `continue`
For k == 5 we have k % 2 == 1
The following three pieces of code all print the exact same messages, regardless of the use of the continue
keyword:
for k in [1, 2, 3, 4, 5]:
print(k)
for k in [1, 2, 3, 4, 5]:
print(k)
continue
for k in [1, 2, 3, 4, 5]:
print(k)
if float(k)//1 % 2 == 0:
continue