Python docx row.cells return a "merged" cell multiple times

Question:

I’m using the python docx library and need to read data from tables in the document.

Although I’m able to read the data using the following code,

document = Document(path_to_your_docx)
tables = document.tables
for table in tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                print(paragraph.text)

I get multiple duplicate values where contents in a cell spans its merged cells, once for each cell that is merged into it. I cannot simple delete duplicate values, since there might be multiple unmerged cells with the same value. How should I go about fixing this?

For reference, I was directed to ask the question here from this github issue.

Thank you.

Asked By: movinghands

||

Answers:

If you want to get each merged cell exactly once, you can add the following code:

def iter_unique_cells(row):
    """Generate cells in `row` skipping empty grid cells."""
    prior_tc = None
    for cell in row.cells:
        this_tc = cell._tc
        if this_tc is prior_tc:
            continue
        prior_tc = this_tc
        yield cell


document = Document(path_to_your_docx)
for table in document.tables:
    for row in table.rows:
        for cell in iter_unique_cells(row):
            for paragraph in cell.paragraphs:
                print(paragraph.text)

The behavior you see of the same cell in a table appearing once for each "grid" cell it occupies is the expected behavior. It causes problems elsewhere if row cells are not uniform across rows, e.g. if each row in a 3 x 3 table did not necessarily contain 3 cells. For example, accessing row.cell[2] in a three column table would raise an exception if a merged cell was present in that row.

At the same time, it could be useful to have an alternate accessor, perhaps Row.iter_unique_cells() that didn’t guarantee uniformity across rows. That might be a feature worth requesting.

Answered By: scanny

Here is a more up to date version, based on the issue at https://github.com/python-openxml/python-docx/issues/13:

def table_itercells(table):
    for row_idx in range(len(table._tbl.tr_lst)):
        for cell in table._tbl.tr_lst[row_idx].tc_lst:
            yield _Cell(cell, table)
Answered By: caram
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.