Python Loop through Excel sheets, place into one df

Question

I have an excel file foo.xlsx with about 40 sheets sh1, sh2, etc. Each sheet has the format:

area      cnt   namenparty1   namenparty2
blah      9         5               5
word      3         7               5

In each sheet I want to rename the variables with the format namenparty to only have the party as a label. Example output:

area      cnt    party1    party2     sheet
bacon     9         5         5        sh1
spam      3         7         5        sh1
eggs      2         18        4        sh2

I am reading in the file with:

book = pd.ExcelFile(path)

And then wondering if I need to do:

for f in filelist:
    df = pd.ExcelFile.parse(book,sheetname=??)
    'more operations here'
    # only change column names 2 and 3
     i, col in enumerate(df):
     if i>=2 and i<=3:
        new_col_name = col.split("n")[-1]
        df[new_col_name] =

Or something like that?

Asked By: Yolo_chicken

||

Source

Answer 1

The read_excel method of pandas lets you read all sheets in at once if you set the keyword parameter sheet_name=None (in some older versions of pandas this was called sheetname). This returns a dictionary – the keys are the sheet names, and the values are the sheets as dataframes.

Using this, we can simply loop through the dictionary and:

Add an extra column to the dataframes containing the relevant sheetname
Use the rename method to rename our columns – by using a lambda, we simply take the final entry of the list obtained by splitting each column name any time there is a new line. If there is no new line, the column name is unchanged.
Append to a list, to be combined at the end.

Once this is done, we combine all the sheets into one with pd.concat. Then we reset the index and all should be well. Note: if you have parties present on one sheet but not others, this will still work but will fill any missing columns for each sheet with NaN.

import pandas as pd

sheets_dict = pd.read_excel('Book1.xlsx', sheet_name=None)

all_sheets = []
for name, sheet in sheets_dict.items():
    sheet['sheet'] = name
    sheet = sheet.rename(columns=lambda x: x.split('n')[-1])
    all_sheets.append(sheet)

full_table = pd.concat(all_sheets)
full_table.reset_index(inplace=True, drop=True)

print(full_table)

Prints:

    area  cnt  party1  party2   sheet
0  bacon    9       5       5  Sheet1
1   spam    3       7       5  Sheet1
2   eggs    2      18       4  Sheet2

Answered By: asongtoruin

Answer 2

Consider the following code also using Panda library.

It takes in only a single sheet and uses df’s iterrows():

def read_csv():
    filename = "file.xlsx"
    sheet_name = "Sheet Name"
    df = pd.read_excel(filename, sheet_name=sheet_name)
    # Updating Nan to null
    df = df.where(pd.notnull(df), None)
    data = []
    for index, row in df.iterrows():
        # you can take data as row[COLUMN_NAME], then append it to data like data.append({'column': row[column})
    return data

It’s not entirely related to question asked. Just posting for anybody whose needed

Answered By: Haribk

Answer 3

Sometimes if the Excel file is really large, instead of reading the entire file into memory, it’s better if you read the sheets in one by one. You can do using ExcelFile:

with pd.ExcelFile('foo.xlsx') as f:
    sheets = f.sheet_names
    for sht in sheets:
        df = f.parse(sht)
        # do something with df

That said, if the task is to concatenate all sheets into a single frame, there’s also a one-liner available:

joined_df = pd.concat(pd.read_excel('foo.xlsx', sheet_name=None).values(), ignore_index=True)

or for OP’s specific case, pass in names to overwrite the column names of each sheet (instead of operating on each sheet) and concatenate them all.

joined_df = (
    pd.concat(pd.read_excel('foo.xlsx', names=['area','cnt','party1','party2'], sheet_name=None))
    .rename_axis(['Sheet', None]).reset_index(level=0)
    .reset_index(drop=True)
)

Answered By: cottontail

Python Loop through Excel sheets, place into one df

Question:

Answers: