Iterate through all sheets of all workbooks in a directory

Question:

I am trying to combine all spreadsheets from all workbooks in a directory into a single df. I’ve tried with glob and with os.scandir but either way I keep only getting the first sheet of all workbooks.
First attempt:

import pandas as pd
import glob

workbooks = glob.glob(r"mydirectory*.xlsx")
list = []
for file in workbooks:
    df = pd.concat(pd.read_excel(file, sheet_name=None), ignore_index = True)
    list.append(df)
dataframe = pd.concat(list, axis = 0)

Second attempt:

import os
import pandas as pd
df = pd.DataFrame()
path = r"mydirectory"
with os.scandir(path) as files:
    for file in files:
        data = pd.read_excel(file, sheet_name = None)
        df = df.append(data) 

I think the issue lies with the for loop but I’m too inexperienced to pin down the problem. Any help would be greatly appreciated, thx!!!

Asked By: Maradam

||

Answers:

If I understand what you have written correctly, you want something like this:

import pandas as pd
import glob

# list of workbooks in directory
workbooks = glob.glob(r"mydirectory*.xlsx")
l = []

# for each file in list
for file in workbooks:
    # Class for file allows for retrieving sheet names
    xl_file = pd.ExcelFile(file)    
    # concatenate DataFrames created from each sheet in the file
    df = pd.concat([pd.read_excel(file, sheet) for sheet in xl_file.sheet_names], ignore_index=True)
    # append to list
    l.append(df)
# concatenate all file DataFrames to one DataFrame.
dataframe = pd.concat(l, axis=0)

This loops through the sheets within the Excel file for the concatenation, the only difference to what you had already written.

Alternative:

Alternatively, without needing to first find the sheet names, the dictionary created by pd.read_excel(file, sheet_name=None) can used.

import pandas as pd
import glob

# list of workbooks in directory
workbooks = glob.glob(r"mydirectory*.xlsx")
l = []

# for each file in list
for file in workbooks:
    # concatenate the dictionary of dataframes from pd.read_excel
    df = pd.concat(pd.read_excel(file, sheet_name=None), ignore_index=True)
    l.append(df)
# concatenate all file DataFrames to one DataFrame.
dataframe = pd.concat(l, axis=0)

A good explanation/example of the use of sheet_name=None can be found here. In short, the use of this returns a dictionary of DataFrames for each sheet. This can then be concatenated to one DataFrame, as above, or an individual sheet’s DataFrame could be accessed through dictionary["sheet_name"].

Answered By: Rawson
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.