Is it possible to get an Excel document's row count without loading the entire document into memory?

Question:

I’m working on an application that processes huge Excel 2007 files, and I’m using OpenPyXL to do it. OpenPyXL has two different methods of reading an Excel file – one “normal” method where the entire document is loaded into memory at once, and one method where iterators are used to read row-by-row.

The problem is that when I’m using the iterator method, I don’t get any document meta-data like column widths and row/column count, and i really need this data. I assume this data is stored in the Excel document close to the top, so it shouldn’t be necessary to load the whole 10MB file into memory to get access to it.

So, is there a way to get ahold of the row/column count and column widths without loading the entire document into memory first?

Asked By: Hubro

||

Answers:

The solution suggested in this answer has been deprecated, and might no longer work.


Taking a look at the source code of OpenPyXL (IterableWorksheet) I’ve figured out how to get the column and row count from an iterator worksheet:

wb = load_workbook(path, use_iterators=True)
sheet = wb.worksheets[0]

row_count = sheet.get_highest_row() - 1
column_count = letter_to_index(sheet.get_highest_column()) + 1

IterableWorksheet.get_highest_column returns a string with the column letter that you can see in Excel, e.g. "A", "B", "C" etc. Therefore I’ve also written a function to translate the column letter to a zero based index:

def letter_to_index(letter):
    """Converts a column letter, e.g. "A", "B", "AA", "BC" etc. to a zero based
    column index.

    A becomes 0, B becomes 1, Z becomes 25, AA becomes 26 etc.

    Args:
        letter (str): The column index letter.
    Returns:
        The column index as an integer.
    """
    letter = letter.upper()
    result = 0

    for index, char in enumerate(reversed(letter)):
        # Get the ASCII number of the letter and subtract 64 so that A
        # corresponds to 1.
        num = ord(char) - 64

        # Multiply the number with 26 to the power of `index` to get the correct
        # value of the letter based on it's index in the string.
        final_num = (26 ** index) * num

        result += final_num

    # Subtract 1 from the result to make it zero-based before returning.
    return result - 1

I still haven’t figured out how to get the column sizes though, so I’ve decided to use a fixed-width font and automatically scaled columns in my application.

Answered By: Hubro

This might be extremely convoluted and I might be missing the obvious, but without OpenPyXL filling in the column_dimensions in Iterable Worksheets (see my comment above), the only way I can see of finding the column size without loading everything is to parse the xml directly:

from xml.etree.ElementTree import iterparse
from openpyxl import load_workbook
wb=load_workbook("/path/to/workbook.xlsx", use_iterators=True)
ws=wb.worksheets[0]
xml = ws._xml_source
xml.seek(0)

for _,x in iterparse(xml):

    name= x.tag.split("}")[-1]
    if name=="col":
        print "Column %(max)s: Width: %(width)s"%x.attrib # width = x.attrib["width"]

    if name=="cols":
        print "break before reading the rest of the file"
        break
Answered By: Markus

Adding on to what Hubro said, apparently get_highest_row() has been deprecated. Using the max_row and max_column properties returns the row and column count. For example:

    wb = load_workbook(path, use_iterators=True)
    sheet = wb.worksheets[0]

    row_count = sheet.max_row
    column_count = sheet.max_column
Answered By: dransom90

https://pythonhosted.org/pyexcel/iapi/pyexcel.sheets.Sheet.html
see : row_range() Utility function to get row range

if you use pyexcel, can call row_range get max rows.

python 3.4 test pass.

Answered By: delphisharp

Python 3

import openpyxl as xl

wb = xl.load_workbook("Sample.xlsx", enumerate)

#the 2 lines under do the same. 
sheet = wb.get_sheet_by_name('sheet') 
sheet = wb.worksheets[0]

row_count = sheet.max_row
column_count = sheet.max_column

#this works fore me.
Answered By: Anders Stamnes

Options using pandas.

  1. Gets all sheetnames with count of rows and columns.
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
for sheet in sheetnames:
    df = xl.parse(sheet)
    dimensions = df.shape
    print('sheetname', ' --> ', dimensions)
  1. Single sheet count of rows and columns.
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
df = xl.parse(sheetnames[0])   # [0] get first tab/sheet.
dimensions = df.shape
print(f'sheetname: "{sheetnames[0]}" - -> {dimensions}')

output sheetname "Sheet1" --> (row count, column count)

Answered By: Cam

I found that sometimes Openpyxl gives wrong row and column numbers. So I used pandas library. Here is the code

import pandas as pd


name_of_file =  "test.xlsx"
data = pd.read_excel(name_of_file)

total_rows = len(data)
total_columns = len(data.columns)

print(total_rows, total_columns)
Answered By: Mounesh
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.