Iterate through all sheets of all workbooks in a directory
Question:
I am trying to combine all spreadsheets from all workbooks in a directory into a single df. I’ve tried with glob
and with os.scandir
but either way I keep only getting the first sheet of all workbooks.
First attempt:
import pandas as pd
import glob
workbooks = glob.glob(r"mydirectory*.xlsx")
list = []
for file in workbooks:
df = pd.concat(pd.read_excel(file, sheet_name=None), ignore_index = True)
list.append(df)
dataframe = pd.concat(list, axis = 0)
Second attempt:
import os
import pandas as pd
df = pd.DataFrame()
path = r"mydirectory"
with os.scandir(path) as files:
for file in files:
data = pd.read_excel(file, sheet_name = None)
df = df.append(data)
I think the issue lies with the for
loop but I’m too inexperienced to pin down the problem. Any help would be greatly appreciated, thx!!!
Answers:
If I understand what you have written correctly, you want something like this:
import pandas as pd
import glob
# list of workbooks in directory
workbooks = glob.glob(r"mydirectory*.xlsx")
l = []
# for each file in list
for file in workbooks:
# Class for file allows for retrieving sheet names
xl_file = pd.ExcelFile(file)
# concatenate DataFrames created from each sheet in the file
df = pd.concat([pd.read_excel(file, sheet) for sheet in xl_file.sheet_names], ignore_index=True)
# append to list
l.append(df)
# concatenate all file DataFrames to one DataFrame.
dataframe = pd.concat(l, axis=0)
This loops through the sheets within the Excel file for the concatenation, the only difference to what you had already written.
Alternative:
Alternatively, without needing to first find the sheet names, the dictionary created by pd.read_excel(file, sheet_name=None)
can used.
import pandas as pd
import glob
# list of workbooks in directory
workbooks = glob.glob(r"mydirectory*.xlsx")
l = []
# for each file in list
for file in workbooks:
# concatenate the dictionary of dataframes from pd.read_excel
df = pd.concat(pd.read_excel(file, sheet_name=None), ignore_index=True)
l.append(df)
# concatenate all file DataFrames to one DataFrame.
dataframe = pd.concat(l, axis=0)
A good explanation/example of the use of sheet_name=None
can be found here. In short, the use of this returns a dictionary of DataFrames for each sheet. This can then be concatenated to one DataFrame, as above, or an individual sheet’s DataFrame could be accessed through dictionary["sheet_name"]
.
I am trying to combine all spreadsheets from all workbooks in a directory into a single df. I’ve tried with glob
and with os.scandir
but either way I keep only getting the first sheet of all workbooks.
First attempt:
import pandas as pd
import glob
workbooks = glob.glob(r"mydirectory*.xlsx")
list = []
for file in workbooks:
df = pd.concat(pd.read_excel(file, sheet_name=None), ignore_index = True)
list.append(df)
dataframe = pd.concat(list, axis = 0)
Second attempt:
import os
import pandas as pd
df = pd.DataFrame()
path = r"mydirectory"
with os.scandir(path) as files:
for file in files:
data = pd.read_excel(file, sheet_name = None)
df = df.append(data)
I think the issue lies with the for
loop but I’m too inexperienced to pin down the problem. Any help would be greatly appreciated, thx!!!
If I understand what you have written correctly, you want something like this:
import pandas as pd
import glob
# list of workbooks in directory
workbooks = glob.glob(r"mydirectory*.xlsx")
l = []
# for each file in list
for file in workbooks:
# Class for file allows for retrieving sheet names
xl_file = pd.ExcelFile(file)
# concatenate DataFrames created from each sheet in the file
df = pd.concat([pd.read_excel(file, sheet) for sheet in xl_file.sheet_names], ignore_index=True)
# append to list
l.append(df)
# concatenate all file DataFrames to one DataFrame.
dataframe = pd.concat(l, axis=0)
This loops through the sheets within the Excel file for the concatenation, the only difference to what you had already written.
Alternative:
Alternatively, without needing to first find the sheet names, the dictionary created by pd.read_excel(file, sheet_name=None)
can used.
import pandas as pd
import glob
# list of workbooks in directory
workbooks = glob.glob(r"mydirectory*.xlsx")
l = []
# for each file in list
for file in workbooks:
# concatenate the dictionary of dataframes from pd.read_excel
df = pd.concat(pd.read_excel(file, sheet_name=None), ignore_index=True)
l.append(df)
# concatenate all file DataFrames to one DataFrame.
dataframe = pd.concat(l, axis=0)
A good explanation/example of the use of sheet_name=None
can be found here. In short, the use of this returns a dictionary of DataFrames for each sheet. This can then be concatenated to one DataFrame, as above, or an individual sheet’s DataFrame could be accessed through dictionary["sheet_name"]
.