Python Loop through Excel sheets, place into one df
Question:
I have an excel file foo.xlsx
with about 40 sheets sh1
, sh2
, etc. Each sheet has the format:
area cnt namenparty1 namenparty2
blah 9 5 5
word 3 7 5
In each sheet I want to rename the variables with the format namenparty
to only have the party
as a label. Example output:
area cnt party1 party2 sheet
bacon 9 5 5 sh1
spam 3 7 5 sh1
eggs 2 18 4 sh2
I am reading in the file with:
book = pd.ExcelFile(path)
And then wondering if I need to do:
for f in filelist:
df = pd.ExcelFile.parse(book,sheetname=??)
'more operations here'
# only change column names 2 and 3
i, col in enumerate(df):
if i>=2 and i<=3:
new_col_name = col.split("n")[-1]
df[new_col_name] =
Or something like that?
Answers:
The read_excel
method of pandas
lets you read all sheets in at once if you set the keyword parameter sheet_name=None
(in some older versions of pandas
this was called sheetname
). This returns a dictionary – the keys are the sheet names, and the values are the sheets as dataframes.
Using this, we can simply loop through the dictionary and:
- Add an extra column to the dataframes containing the relevant sheetname
- Use the
rename
method to rename our columns – by using a lambda
, we simply take the final entry of the list obtained by splitting each column name any time there is a new line. If there is no new line, the column name is unchanged.
- Append to a list, to be combined at the end.
Once this is done, we combine all the sheets into one with pd.concat
. Then we reset the index and all should be well. Note: if you have parties present on one sheet but not others, this will still work but will fill any missing columns for each sheet with NaN
.
import pandas as pd
sheets_dict = pd.read_excel('Book1.xlsx', sheet_name=None)
all_sheets = []
for name, sheet in sheets_dict.items():
sheet['sheet'] = name
sheet = sheet.rename(columns=lambda x: x.split('n')[-1])
all_sheets.append(sheet)
full_table = pd.concat(all_sheets)
full_table.reset_index(inplace=True, drop=True)
print(full_table)
Prints:
area cnt party1 party2 sheet
0 bacon 9 5 5 Sheet1
1 spam 3 7 5 Sheet1
2 eggs 2 18 4 Sheet2
Consider the following code also using Panda library.
It takes in only a single sheet and uses df’s iterrows()
:
def read_csv():
filename = "file.xlsx"
sheet_name = "Sheet Name"
df = pd.read_excel(filename, sheet_name=sheet_name)
# Updating Nan to null
df = df.where(pd.notnull(df), None)
data = []
for index, row in df.iterrows():
# you can take data as row[COLUMN_NAME], then append it to data like data.append({'column': row[column})
return data
It’s not entirely related to question asked. Just posting for anybody whose needed
Sometimes if the Excel file is really large, instead of reading the entire file into memory, it’s better if you read the sheets in one by one. You can do using ExcelFile
:
with pd.ExcelFile('foo.xlsx') as f:
sheets = f.sheet_names
for sht in sheets:
df = f.parse(sht)
# do something with df
That said, if the task is to concatenate all sheets into a single frame, there’s also a one-liner available:
joined_df = pd.concat(pd.read_excel('foo.xlsx', sheet_name=None).values(), ignore_index=True)
or for OP’s specific case, pass in names
to overwrite the column names of each sheet (instead of operating on each sheet) and concatenate them all.
joined_df = (
pd.concat(pd.read_excel('foo.xlsx', names=['area','cnt','party1','party2'], sheet_name=None))
.rename_axis(['Sheet', None]).reset_index(level=0)
.reset_index(drop=True)
)
I have an excel file foo.xlsx
with about 40 sheets sh1
, sh2
, etc. Each sheet has the format:
area cnt namenparty1 namenparty2
blah 9 5 5
word 3 7 5
In each sheet I want to rename the variables with the format namenparty
to only have the party
as a label. Example output:
area cnt party1 party2 sheet
bacon 9 5 5 sh1
spam 3 7 5 sh1
eggs 2 18 4 sh2
I am reading in the file with:
book = pd.ExcelFile(path)
And then wondering if I need to do:
for f in filelist:
df = pd.ExcelFile.parse(book,sheetname=??)
'more operations here'
# only change column names 2 and 3
i, col in enumerate(df):
if i>=2 and i<=3:
new_col_name = col.split("n")[-1]
df[new_col_name] =
Or something like that?
The read_excel
method of pandas
lets you read all sheets in at once if you set the keyword parameter sheet_name=None
(in some older versions of pandas
this was called sheetname
). This returns a dictionary – the keys are the sheet names, and the values are the sheets as dataframes.
Using this, we can simply loop through the dictionary and:
- Add an extra column to the dataframes containing the relevant sheetname
- Use the
rename
method to rename our columns – by using alambda
, we simply take the final entry of the list obtained by splitting each column name any time there is a new line. If there is no new line, the column name is unchanged. - Append to a list, to be combined at the end.
Once this is done, we combine all the sheets into one with pd.concat
. Then we reset the index and all should be well. Note: if you have parties present on one sheet but not others, this will still work but will fill any missing columns for each sheet with NaN
.
import pandas as pd
sheets_dict = pd.read_excel('Book1.xlsx', sheet_name=None)
all_sheets = []
for name, sheet in sheets_dict.items():
sheet['sheet'] = name
sheet = sheet.rename(columns=lambda x: x.split('n')[-1])
all_sheets.append(sheet)
full_table = pd.concat(all_sheets)
full_table.reset_index(inplace=True, drop=True)
print(full_table)
Prints:
area cnt party1 party2 sheet
0 bacon 9 5 5 Sheet1
1 spam 3 7 5 Sheet1
2 eggs 2 18 4 Sheet2
Consider the following code also using Panda library.
It takes in only a single sheet and uses df’s iterrows()
:
def read_csv():
filename = "file.xlsx"
sheet_name = "Sheet Name"
df = pd.read_excel(filename, sheet_name=sheet_name)
# Updating Nan to null
df = df.where(pd.notnull(df), None)
data = []
for index, row in df.iterrows():
# you can take data as row[COLUMN_NAME], then append it to data like data.append({'column': row[column})
return data
It’s not entirely related to question asked. Just posting for anybody whose needed
Sometimes if the Excel file is really large, instead of reading the entire file into memory, it’s better if you read the sheets in one by one. You can do using ExcelFile
:
with pd.ExcelFile('foo.xlsx') as f:
sheets = f.sheet_names
for sht in sheets:
df = f.parse(sht)
# do something with df
That said, if the task is to concatenate all sheets into a single frame, there’s also a one-liner available:
joined_df = pd.concat(pd.read_excel('foo.xlsx', sheet_name=None).values(), ignore_index=True)
or for OP’s specific case, pass in names
to overwrite the column names of each sheet (instead of operating on each sheet) and concatenate them all.
joined_df = (
pd.concat(pd.read_excel('foo.xlsx', names=['area','cnt','party1','party2'], sheet_name=None))
.rename_axis(['Sheet', None]).reset_index(level=0)
.reset_index(drop=True)
)