How to increase process speed using read_excel in pandas?

Question:

I need use pd.read_excel to process every sheet in one excel file.
But in most cases,I did not know the sheet name.
So I use this to judge how many sheet in excel:

i_sheet_count=0
i=0
try:
  df.read_excel('/tmp/1.xlsx',sheetname=i)
  i_sheet_count+=1
  i+=1
else:
  i+=1
print(i_sheet_count)

During the process,I found that the process is quite slow,
So,can read_excel only read limited rows to improve the speed?
I tried nrows but did not work..still slow..

Asked By: james.peng

||

Answers:

Read all worksheets without guessing

Use sheet_name = None argument to pd.read_excel. This will read all worksheets into a dictionary of dataframes. For example:

dfs = pd.read_excel('file.xlsx', sheet_name=None)

# access 'Sheet1' worksheet
res = dfs['Sheet1']

Limit number of rows or columns

You can use parse_cols and skip_footer arguments to limit the number of columns and/or rows. This will reduce read time, and also works with sheet_name = None.

For example, the following will read the first 3 columns and, if your worksheet has 100 rows, it will read only the first 20.

df = pd.read_excel('file.xlsx', sheet_name=None, parse_cols='A:C', skip_footer=80)

If you wish to apply worksheet-specific logic, you can do so by extracting sheet_names:

sheet_names = pd.ExcelFile('file.xlsx', on_demand=True).sheet_names

dfs = {}
for sheet in sheet_names:
    dfs[sheet] = pd.read_excel('file.xlsx', sheet)

Improving performance

Reading Excel files into Pandas is naturally slower than other options (CSV, Pickle, HDF5). If you wish to improve performance, I strongly suggest you consider these other formats.

One option, for example, is to use a VBA script to convert your Excel worksheets to CSV files; then use pd.read_csv.

Edit 02 Nov: correct sheetname to sheet_name

Answered By: jpp

I had excel with many sheets. I wanted only those sheets whose state is visible. If you do not know about that its fine. But if you want to read your sheet names from excel then you can use this code. it took me average 3 sec time to read near about 20 sheet names. It takes quite a few attempts to get this.

file_name = r'C:Usersxyz.xlsx'
File_sheet_list = []
workbookObj = pd.ExcelFile(file_name)
LenOfWorkBook = len(workbookObj.book.worksheets)
idx = 0
for idx in range(0, LenOfWorkBook ):
  if workbookObj.book.worksheets[idx].sheet_state == "visible":
    File_sheet_list.append(workbookObj.book.worksheets[idx].title)
Answered By: Swapnil Kulkarni