Using Pandas to pd.read_excel() for multiple worksheets of the same workbook
Question:
I have a large spreadsheet file (.xlsx) that I’m processing using python pandas. It happens that I need data from two tabs (sheets) in that large file. One of the tabs has a ton of data and the other is just a few square cells.
When I use pd.read_excel()
on any worksheet, it looks to me like the whole file is loaded (not just the worksheet I’m interested in). So when I use the method twice (once for each sheet), I effectively have to suffer the whole workbook being read in twice (even though we’re only using the specified sheet).
How do I only load specific sheet(s) with pd.read_excel()
?
Answers:
Try pd.ExcelFile
:
xls = pd.ExcelFile('path_to_file.xls')
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')
As noted by @HaPsantran, the entire Excel file is read in during the ExcelFile()
call (there doesn’t appear to be a way around this). This merely saves you from having to read the same file in each time you want to access a new sheet.
Note that the sheet_name
argument to pd.read_excel()
can be the name of the sheet (as above), an integer specifying the sheet number (eg 0, 1, etc), a list of sheet names or indices, or None
. If a list is provided, it returns a dictionary where the keys are the sheet names/indices and the values are the data frames. The default is to simply return the first sheet (ie, sheet_name=0
).
If None
is specified, all sheets are returned, as a {sheet_name:dataframe}
dictionary.
You can also use the index for the sheet:
xls = pd.ExcelFile('path_to_file.xls')
sheet1 = xls.parse(0)
will give the first worksheet. for the second worksheet:
sheet2 = xls.parse(1)
You could also specify the sheet name as a parameter:
data_file = pd.read_excel('path_to_file.xls', sheet_name="sheet_name")
will upload only the sheet "sheet_name"
.
There are a few options:
Read all sheets directly into an ordered dictionary.
import pandas as pd
# for pandas version >= 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheet_name=None)
# for pandas version < 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheetname=None)
Read the first sheet directly into dataframe
df = pd.read_excel('excel_file_path.xls')
# this will read the first sheet into df
Read the excel file and get a list of sheets. Then chose and load the sheets.
xls = pd.ExcelFile('excel_file_path.xls')
# Now you can list all sheets in the file
xls.sheet_names
# ['house', 'house_extra', ...]
# to read just one sheet to dataframe:
df = pd.read_excel(file_name, sheet_name="house")
Read all sheets and store it in a dictionary. Same as first but more explicit.
# to read all sheets to a map
sheet_to_df_map = {}
for sheet_name in xls.sheet_names:
sheet_to_df_map[sheet_name] = xls.parse(sheet_name)
# you can also use sheet_index [0,1,2..] instead of sheet name.
Thanks @ihightower for pointing it out way to read all sheets and @toto_tico,@red-headphone for pointing out the version issue.
sheetname : string, int, mixed list of strings/ints, or None, default 0
Deprecated since version 0.21.0: Use sheet_name instead Source Link
Yes unfortunately it will always load the full file. If you’re doing this repeatedly probably best to extract the sheets to separate CSVs and then load separately. You can automate that process with d6tstack which also adds additional features like checking if all the columns are equal across all sheets or multiple Excel files.
import d6tstack
c = d6tstack.convert_xls.XLStoCSVMultiSheet('multisheet.xlsx')
c.convert_all() # ['multisheet-Sheet1.csv','multisheet-Sheet2.csv']
pd.read_excel('filename.xlsx')
by default read the first sheet of workbook.
pd.read_excel('filename.xlsx', sheet_name = 'sheetname')
read the specific sheet of workbook and
pd.read_excel('filename.xlsx', sheet_name = None)
read all the worksheets from excel to pandas dataframe as a type of OrderedDict means nested dataframes, all the worksheets as dataframes collected inside dataframe and it’s type is OrderedDict.
If you have saved the excel file in the same folder as your python program (relative paths) then you just need to mention sheet number along with file name.
Example:
data = pd.read_excel("wt_vs_ht.xlsx", "Sheet2")
print(data)
x = data.Height
y = data.Weight
plt.plot(x,y,'x')
plt.show()
If you are interested in reading all sheets and merging them together. The best and fastest way to do it
sheet_to_df_map = pd.read_excel('path_to_file.xls', sheet_name=None)
mdf = pd.concat(sheet_to_df_map, axis=0, ignore_index=True)
This will convert all the sheet into a single data frame m_df
If:
- you want multiple, but not all, worksheets, and
- you want a single df as an output
Then, you can pass a list of worksheet names. Which you could populate manually:
import pandas as pd
path = "C:\Path\To\Your\Data\"
file = "data.xlsx"
sheet_lst_wanted = ["01_SomeName","05_SomeName","12_SomeName"] # tab names from Excel
### import and compile data ###
# read all sheets from list into an ordered dictionary
dict_temp = pd.read_excel(path+file, sheet_name= sheet_lst_wanted)
# concatenate the ordered dict items into a dataframe
df = pd.concat(dict_temp, axis=0, ignore_index=True)
OR
A bit of automation is possible if your desired worksheets have a common naming convention that also allows you to differentiate from unwanted sheets:
# substitute following block for the sheet_lst_wanted line in above block
import xlrd
# string common to only worksheets you want
str_like = "SomeName"
### create list of sheet names in Excel file ###
xls = xlrd.open_workbook(path+file, on_demand=True)
sheet_lst = xls.sheet_names()
### create list of sheets meeting criteria ###
sheet_lst_wanted = []
for s in sheet_lst:
# note: following conditional statement based on my sheets ending with the string defined in sheet_like
if s[-len(str_like):] == str_like:
sheet_lst_wanted.append(s)
else:
pass
There are various options depending on the use case:
-
If one doesn’t know the sheets names.
-
If the sheets name is not relevant.
-
If one knows the name of the sheets.
Below we will look closely at each of the options.
See the Notes section for information such as finding out the sheet names.
Option 1
If one doesn’t know the sheets names
# Read all sheets in your File
df = pd.read_excel('FILENAME.xlsx', sheet_name=None)
# Prints all the sheets name in an ordered dictionary
print(df.keys())
Then, depending on the sheet one wants to read, one can pass each of them to a specific dataframe
, such as
sheet1_df = pd.read_excel('FILENAME.xlsx', sheet_name=SHEET1NAME)
sheet2_df = pd.read_excel('FILENAME.xlsx', sheet_name=SHEET2NAME)
Option 2
If the name is not relevant and all one cares about is the position of the sheet. Let’s say one wants only the first sheet
# Read all sheets in your File
df = pd.read_excel('FILENAME.xlsx', sheet_name=None)
sheet1 = list(df.keys())[0]
Then, depending on the sheet name, one can pass each it to a specific dataframe
, such as
sheet1_df = pd.read_excel('FILENAME.xlsx', sheet_name=SHEET1NAME)
Option 3
Here we will consider the case where one knows the name of the sheets.
For the examples, one will consider that there are three sheets named Sheet1
, Sheet2
, and Sheet3
. The content in each is the same, and looks like this
0 1 2
0 85 January 2000
1 95 February 2001
2 105 March 2002
3 115 April 2003
4 125 May 2004
5 135 June 2005
With this, depending on one’s goals, there are multiple approaches:
-
Store everything in same dataframe. One approach would be to concat the sheets as follows
sheets = ['Sheet1', 'Sheet2', 'Sheet3']
df = pd.concat([pd.read_excel('FILENAME.xlsx', sheet_name = sheet) for sheet in sheets], ignore_index = True)
[Out]:
0 1 2
0 85 January 2000
1 95 February 2001
2 105 March 2002
3 115 April 2003
4 125 May 2004
5 135 June 2005
6 85 January 2000
7 95 February 2001
8 105 March 2002
9 115 April 2003
10 125 May 2004
11 135 June 2005
12 85 January 2000
13 95 February 2001
14 105 March 2002
15 115 April 2003
16 125 May 2004
17 135 June 2005
Basically, this how pandas.concat
works (Source):
-
Store each sheet in a different dataframe (let’s say, df1
, df2
, …)
sheets = ['Sheet1', 'Sheet2', 'Sheet3']
for i, sheet in enumerate(sheets):
globals()['df' + str(i + 1)] = pd.read_excel('FILENAME.xlsx', sheet_name = sheet)
[Out]:
# df1
0 1 2
0 85 January 2000
1 95 February 2001
2 105 March 2002
3 115 April 2003
4 125 May 2004
5 135 June 2005
# df2
0 1 2
0 85 January 2000
1 95 February 2001
2 105 March 2002
3 115 April 2003
4 125 May 2004
5 135 June 2005
# df3
0 1 2
0 85 January 2000
1 95 February 2001
2 105 March 2002
3 115 April 2003
4 125 May 2004
5 135 June 2005
Notes:
-
If one wants to know the sheets names, one can use the ExcelFile
class as follows
sheets = pd.ExcelFile('FILENAME.xlsx').sheet_names
[Out]: ['Sheet1', 'Sheet2', 'Sheet3']
-
In this case one is assuming that the file FILENAME.xlsx
is on the same directory as the script one is running.
-
If the file is in a folder of the current directory called Data, one way would be to use r'./Data/FILENAME.xlsx'
create a variable, such as path
as follows
path = r'./Data/Test.xlsx'
df = pd.read_excel(r'./Data/FILENAME.xlsx', sheet_name=None)
-
This might be a relevant read.
df = pd.read_excel('FileName.xlsx', 'SheetName')
This will read sheet SheetName
from file FileName.xlsx
You can read all the sheets using the following lines
import pandas as pd
file_instance = pd.ExcelFile('your_file.xlsx')
main_df = pd.concat([pd.read_excel('your_file.xlsx', sheet_name=name) for name in file_instance.sheet_names] , axis=0)
I have a large spreadsheet file (.xlsx) that I’m processing using python pandas. It happens that I need data from two tabs (sheets) in that large file. One of the tabs has a ton of data and the other is just a few square cells.
When I use pd.read_excel()
on any worksheet, it looks to me like the whole file is loaded (not just the worksheet I’m interested in). So when I use the method twice (once for each sheet), I effectively have to suffer the whole workbook being read in twice (even though we’re only using the specified sheet).
How do I only load specific sheet(s) with pd.read_excel()
?
Try pd.ExcelFile
:
xls = pd.ExcelFile('path_to_file.xls')
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')
As noted by @HaPsantran, the entire Excel file is read in during the ExcelFile()
call (there doesn’t appear to be a way around this). This merely saves you from having to read the same file in each time you want to access a new sheet.
Note that the sheet_name
argument to pd.read_excel()
can be the name of the sheet (as above), an integer specifying the sheet number (eg 0, 1, etc), a list of sheet names or indices, or None
. If a list is provided, it returns a dictionary where the keys are the sheet names/indices and the values are the data frames. The default is to simply return the first sheet (ie, sheet_name=0
).
If None
is specified, all sheets are returned, as a {sheet_name:dataframe}
dictionary.
You can also use the index for the sheet:
xls = pd.ExcelFile('path_to_file.xls')
sheet1 = xls.parse(0)
will give the first worksheet. for the second worksheet:
sheet2 = xls.parse(1)
You could also specify the sheet name as a parameter:
data_file = pd.read_excel('path_to_file.xls', sheet_name="sheet_name")
will upload only the sheet "sheet_name"
.
There are a few options:
Read all sheets directly into an ordered dictionary.
import pandas as pd
# for pandas version >= 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheet_name=None)
# for pandas version < 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheetname=None)
Read the first sheet directly into dataframe
df = pd.read_excel('excel_file_path.xls')
# this will read the first sheet into df
Read the excel file and get a list of sheets. Then chose and load the sheets.
xls = pd.ExcelFile('excel_file_path.xls')
# Now you can list all sheets in the file
xls.sheet_names
# ['house', 'house_extra', ...]
# to read just one sheet to dataframe:
df = pd.read_excel(file_name, sheet_name="house")
Read all sheets and store it in a dictionary. Same as first but more explicit.
# to read all sheets to a map
sheet_to_df_map = {}
for sheet_name in xls.sheet_names:
sheet_to_df_map[sheet_name] = xls.parse(sheet_name)
# you can also use sheet_index [0,1,2..] instead of sheet name.
Thanks @ihightower for pointing it out way to read all sheets and @toto_tico,@red-headphone for pointing out the version issue.
sheetname : string, int, mixed list of strings/ints, or None, default 0
Deprecated since version 0.21.0: Use sheet_name instead Source Link
Yes unfortunately it will always load the full file. If you’re doing this repeatedly probably best to extract the sheets to separate CSVs and then load separately. You can automate that process with d6tstack which also adds additional features like checking if all the columns are equal across all sheets or multiple Excel files.
import d6tstack
c = d6tstack.convert_xls.XLStoCSVMultiSheet('multisheet.xlsx')
c.convert_all() # ['multisheet-Sheet1.csv','multisheet-Sheet2.csv']
pd.read_excel('filename.xlsx')
by default read the first sheet of workbook.
pd.read_excel('filename.xlsx', sheet_name = 'sheetname')
read the specific sheet of workbook and
pd.read_excel('filename.xlsx', sheet_name = None)
read all the worksheets from excel to pandas dataframe as a type of OrderedDict means nested dataframes, all the worksheets as dataframes collected inside dataframe and it’s type is OrderedDict.
If you have saved the excel file in the same folder as your python program (relative paths) then you just need to mention sheet number along with file name.
Example:
data = pd.read_excel("wt_vs_ht.xlsx", "Sheet2")
print(data)
x = data.Height
y = data.Weight
plt.plot(x,y,'x')
plt.show()
If you are interested in reading all sheets and merging them together. The best and fastest way to do it
sheet_to_df_map = pd.read_excel('path_to_file.xls', sheet_name=None)
mdf = pd.concat(sheet_to_df_map, axis=0, ignore_index=True)
This will convert all the sheet into a single data frame m_df
If:
- you want multiple, but not all, worksheets, and
- you want a single df as an output
Then, you can pass a list of worksheet names. Which you could populate manually:
import pandas as pd
path = "C:\Path\To\Your\Data\"
file = "data.xlsx"
sheet_lst_wanted = ["01_SomeName","05_SomeName","12_SomeName"] # tab names from Excel
### import and compile data ###
# read all sheets from list into an ordered dictionary
dict_temp = pd.read_excel(path+file, sheet_name= sheet_lst_wanted)
# concatenate the ordered dict items into a dataframe
df = pd.concat(dict_temp, axis=0, ignore_index=True)
OR
A bit of automation is possible if your desired worksheets have a common naming convention that also allows you to differentiate from unwanted sheets:
# substitute following block for the sheet_lst_wanted line in above block
import xlrd
# string common to only worksheets you want
str_like = "SomeName"
### create list of sheet names in Excel file ###
xls = xlrd.open_workbook(path+file, on_demand=True)
sheet_lst = xls.sheet_names()
### create list of sheets meeting criteria ###
sheet_lst_wanted = []
for s in sheet_lst:
# note: following conditional statement based on my sheets ending with the string defined in sheet_like
if s[-len(str_like):] == str_like:
sheet_lst_wanted.append(s)
else:
pass
There are various options depending on the use case:
-
If one doesn’t know the sheets names.
-
If the sheets name is not relevant.
-
If one knows the name of the sheets.
Below we will look closely at each of the options.
See the Notes section for information such as finding out the sheet names.
Option 1
If one doesn’t know the sheets names
# Read all sheets in your File
df = pd.read_excel('FILENAME.xlsx', sheet_name=None)
# Prints all the sheets name in an ordered dictionary
print(df.keys())
Then, depending on the sheet one wants to read, one can pass each of them to a specific dataframe
, such as
sheet1_df = pd.read_excel('FILENAME.xlsx', sheet_name=SHEET1NAME)
sheet2_df = pd.read_excel('FILENAME.xlsx', sheet_name=SHEET2NAME)
Option 2
If the name is not relevant and all one cares about is the position of the sheet. Let’s say one wants only the first sheet
# Read all sheets in your File
df = pd.read_excel('FILENAME.xlsx', sheet_name=None)
sheet1 = list(df.keys())[0]
Then, depending on the sheet name, one can pass each it to a specific dataframe
, such as
sheet1_df = pd.read_excel('FILENAME.xlsx', sheet_name=SHEET1NAME)
Option 3
Here we will consider the case where one knows the name of the sheets.
For the examples, one will consider that there are three sheets named Sheet1
, Sheet2
, and Sheet3
. The content in each is the same, and looks like this
0 1 2
0 85 January 2000
1 95 February 2001
2 105 March 2002
3 115 April 2003
4 125 May 2004
5 135 June 2005
With this, depending on one’s goals, there are multiple approaches:
-
Store everything in same dataframe. One approach would be to concat the sheets as follows
sheets = ['Sheet1', 'Sheet2', 'Sheet3'] df = pd.concat([pd.read_excel('FILENAME.xlsx', sheet_name = sheet) for sheet in sheets], ignore_index = True) [Out]: 0 1 2 0 85 January 2000 1 95 February 2001 2 105 March 2002 3 115 April 2003 4 125 May 2004 5 135 June 2005 6 85 January 2000 7 95 February 2001 8 105 March 2002 9 115 April 2003 10 125 May 2004 11 135 June 2005 12 85 January 2000 13 95 February 2001 14 105 March 2002 15 115 April 2003 16 125 May 2004 17 135 June 2005
Basically, this how
pandas.concat
works (Source): -
Store each sheet in a different dataframe (let’s say,
df1
,df2
, …)sheets = ['Sheet1', 'Sheet2', 'Sheet3'] for i, sheet in enumerate(sheets): globals()['df' + str(i + 1)] = pd.read_excel('FILENAME.xlsx', sheet_name = sheet) [Out]: # df1 0 1 2 0 85 January 2000 1 95 February 2001 2 105 March 2002 3 115 April 2003 4 125 May 2004 5 135 June 2005 # df2 0 1 2 0 85 January 2000 1 95 February 2001 2 105 March 2002 3 115 April 2003 4 125 May 2004 5 135 June 2005 # df3 0 1 2 0 85 January 2000 1 95 February 2001 2 105 March 2002 3 115 April 2003 4 125 May 2004 5 135 June 2005
Notes:
-
If one wants to know the sheets names, one can use the
ExcelFile
class as followssheets = pd.ExcelFile('FILENAME.xlsx').sheet_names [Out]: ['Sheet1', 'Sheet2', 'Sheet3']
-
In this case one is assuming that the file
FILENAME.xlsx
is on the same directory as the script one is running.-
If the file is in a folder of the current directory called Data, one way would be to use
r'./Data/FILENAME.xlsx'
create a variable, such aspath
as followspath = r'./Data/Test.xlsx' df = pd.read_excel(r'./Data/FILENAME.xlsx', sheet_name=None)
-
-
This might be a relevant read.
df = pd.read_excel('FileName.xlsx', 'SheetName')
This will read sheet SheetName
from file FileName.xlsx
You can read all the sheets using the following lines
import pandas as pd
file_instance = pd.ExcelFile('your_file.xlsx')
main_df = pd.concat([pd.read_excel('your_file.xlsx', sheet_name=name) for name in file_instance.sheet_names] , axis=0)