Read From PowerPoint Table in Python?

Question

I am using the python pptx module to automatically update values in a powerpoint file. I am able to extract all the text in the file using the code below:

from pptx import Presentation
prs = Presentation(path_to_presentation)
# text_runs will be populated with a list of strings,
# one for each text run in presentation
text_runs = []
for slide in prs.slides:
  for shape in slide.shapes:
    if not shape.has_text_frame:
      continue
  for paragraph in shape.text_frame.paragraphs:
    for run in paragraph.runs:
      text_runs.append(run.text)

This code will extract all the text in a file but fails to extract text that is in a ppt table and I would like to update some of these values. I have tried to implement some code from this question: Reading text values in a PowerPoint table using pptx? but could not. Any Ideas? Thanks.

Asked By: tdognuts

||

Source

Answer 1

Your code will miss more text than just tables; it won’t see text in shapes that are part of groups, for example.

For tables, you’ll need to do a couple things:

Test the shape to see if the shape’s .HasTable property is true. If so, you can work with the shape’s .Table object to extract the text. Conceptually, and very aircode:

For r = 1 to tbl.rows.count
   For c = 1 to tbl.columns.count
      tbl.cell(r,c).Shape.Textframe.Text ' is what you're after

Answered By: Steve Rindsberg

Answer 2

This works for me:

    def access_table(): 
            slide = prs.slides[0] #first slide
            table = slide.shapes[2].table # maybe 0..n
            for r in table.rows:
                    s = ""
                    for c in r.cells:
                            s += c.text_frame.text + " | "
                            #to write
                            #c.text_frame.text = "example"
                    print s

Answered By: German Lopez

Answer 3

How to Extract All of the Text out of Tables Inside of a Slide-show Presentation

The following code extracts text from tables in a slide-show presentation. Text in the presentation outside of tables is omitted, but you can modify my code to capture text from non-table objects as well.

import pptx as pptx
from pptx import *

def get_tables_from_presentation(pres):
   """
   The input parameter `pres` should receive
   an object returned by `pptx.Presentation()`

   EXAMPLE:
       ```
       import pptx
       p = "C:\Users\user\Desktop\power_point_pres.pptx"
       pres = pptx.Presentation(p)

       tables = get_tables_from_presentation(pres)
       ```
   """
   tables = list()
   for slide in pres.slides:
      for shp in iter(slide.shapes):
         if shp.has_table:
            table = shp.table
            tables.append(table)
   return tables


def iter_to_nonempty_table_cells(tbl):
   """
   :param tbl: 'pptx.table.Table'
          input table is NOT modified

   :return: return iterator to non-empty rows
   """
   for ridx in range(sum(1 for _ in iter(tbl.rows))):
      for cidx in range(sum(1 for _ in iter(tbl.columns))):
         cell = tbl.cell(ridx, cidx)
         txt = type("")(cell.text)
         txt = txt.strip()
         if len(txt) > 1:
            yield txt


# establish read path
in_file_path = "C:\Users\user\Desktop\power_point_pres.pptx"

# Open slide-show presentation
pres = Presentation(in_file_path)

# extract tables from slide-show presentation
tables = get_tables_from_presentation(pres)

for tbl in tables:
   it = iter_to_nonempty_table_cells(tbl)
   print("".join(it))

A Note About One of the Other Answers to This Question

Someone else posted a semi-useful answer to this question written in pseudo-code. They wrote the following:

For r = 1 to tbl.rows.count
  For c = 1 to tbl.columns.count
     tbl.cell(r,c).Shape.Textframe.Text

The problem is, that is not python.

In python, it is illegal syntax to write For r = 1 to 10
Instead, we would write something like the following:

for r in range(1, 11):
   print(r)  

from itertools import *
for r in takewhile(lambda k: k <= 10, count(1)):
   print(r)

Additionally, the row indicies start at r = 0 not r = 1

The upper-left corner of the table is tbl.cell(0,0) not tbl.cell(1,1)

There is no such thing as .count for the rows attribute or the columns attribute. (For r = 1 to tbl.rows.count) makes no sense because there is no such thing as tbl.rows.count

tbl.cell(r,c).Shape won’t work, because objects instantiated from the class pptx.table._Cell have no attribute named Shape

cell objects have the following attributes:

fill
is_merge_origin
is_spanned
margin_bottom
margin_left
margin_right
margin_top
merge
part
span_height
span_width
split
text
text_frame
vertical_anchor

A fix is shown below:

# ----------------------------------------
# BEGIN SYNTACTICALLY INCORRECT CODE
# ----------------------------------------
# For r = 1 to tbl.rows.count
#   For c = 1 to tbl.columns.count
#      tbl.cell(r,c).Shape.Textframe.Text
# ----------------------------------------
# END SYNTACTICALLY INCORRECT CODE
# BEGIN SYNTACTICALLY CORRECT CODE
# ----------------------------------------
for r in range(sum(1 for row in iter(tbl.rows))):
    for c in range(sum(1 for _ in iter(tbl.columns))):
        print(tbl.cell(r,c).text)
# ----------------------------------------
# END SYNTACTICALLY CORRECT CODE
# ----------------------------------------

A Note About your Original Code

The `continue` keyword

In your original source code, you have the following for-loop:

for shape in slide.shapes:
    if not shape.has_text_frame:
      continue

That for-loop does not do anything.

The continue keyword simply means "increment the loop-counter and jump to the beginning of the loop" However, there is no code after your continue and before the end of the loop. That is, the loop would have continued anyway without you having to write continue because it is already at the end of the loop-body.

To understand more about continue consider the following example:

for k in [1, 2, 3, 4, 5]:
    print("For k ==", k, "we have k % 2 == ", k % 2)
    if not k % 2 == 0:
        continue
    print("For k ==", k, "we got past the `continue`")

The output is:

For k == 1 we have k % 2 ==  1
For k == 2 we have k % 2 ==  0
For k == 2 we got past the `continue`
For k == 3 we have k % 2 ==  1
For k == 4 we have k % 2 ==  0
For k == 4 we got past the `continue`
For k == 5 we have k % 2 ==  1

The following three pieces of code all print the exact same messages, regardless of the use of the continue keyword:

for k in [1, 2, 3, 4, 5]:
    print(k)

for k in [1, 2, 3, 4, 5]:
    print(k)
    continue

for k in [1, 2, 3, 4, 5]:
    print(k)
    if float(k)//1 % 2 == 0:
        continue

Answered By: Samuel Muldoon

Read From PowerPoint Table in Python?

Question:

Answers:

How to Extract All of the Text out of Tables Inside of a Slide-show Presentation

A Note About One of the Other Answers to This Question

A Note About your Original Code

The `continue` keyword

Read From PowerPoint Table in Python?

Question:

Answers:

How to Extract All of the Text out of Tables Inside of a Slide-show Presentation

A Note About One of the Other Answers to This Question

A Note About your Original Code

The continue keyword

The `continue` keyword