How to iterate through a folder with multiple excel workbooks containing multiple worksheets create a single data frame?

Question:

The following function is what I have come up with to iterate through multiple excel files to store the data into a single data frame. However, only the data from the final file is being stored in the final data frame. What should I do to get the data from all the files to be stored in the same df and then exported to a csv file ?

def excel_to_df(folder, start_row, end_row, start_col, end_col):
    # loop through all excel files in the folder
    for file in os.listdir(folder):
        exact_file_path = folder +  "\\" + file 
        print(exact_file_path)
        # check if file is an excel file
        if file.endswith('xlsx'):
            # create workbook and its worksheets for each file
            wb = openpyxl.load_workbook(exact_file_path)
            ws = wb.worksheets

        # create a list to store the dataframes
        df_list = []

        # iterate over the worksheets
        for worksheet in ws:
            # get the name of the worksheet
            name = worksheet.title
            # create an empty list to store the values
            cell_values = []

            # iterate over the rows and columns in the range

            for row in worksheet.iter_rows(min_row = row_min, max_row = row_max,
                                            min_col = col_min, max_col = col_max):


                                            # append the cell values to the list
                                            cell_values.append([cell.value for cell in row])

                                            # create a dataframe from the cell values and the worksheet name
                                            df = pd.DataFrame(cell_values, columns=range(start_col, end_col+1), index=[name]*len(cell_values))

                                            # append the df to the list
                                            df_list.append(df)


                                            # concatenate the list of dataframes into a single dataframe
                                            df = pd.concat(df_list)
                                            # save the output to a csv file
                                            df.to_csv('test.csv', index=True)

                                            return df
Asked By: VRN NRV

||

Answers:

Your immediate problem is that you’re creating df_list inside the loop so that each time the loop starts over, it will overwrite whatever was already in it. Additionally, your return is at the end of (and inside) the loop so it doesn’t ever get to the second element. When it gets to the return it gives you what it has and stops running. You just need to rearrange it, like this:

def excel_to_df(folder, start_row, end_row, start_col, end_col):
    
    # create a list to store the dataframes
    df_list = []
    # loop through all excel files in the folder
    for file in os.listdir(folder):
        exact_file_path = folder +  "\\" + file 
        print(exact_file_path)
        # check if file is an excel file
        if file.endswith('xlsx'):
            # create workbook and its worksheets for each file
            wb = openpyxl.load_workbook(exact_file_path)
            ws = wb.worksheets
        else:
            # if the file doesn't end with xlsx then don't try to open it as though it is
            next
        # iterate over the worksheets
        for worksheet in ws:
            # get the name of the worksheet
            name = worksheet.title
            # create an empty list to store the values
            cell_values = []

            # iterate over the rows and columns in the range

            for row in worksheet.iter_rows(min_row = row_min, max_row = row_max,
                                            min_col = col_min, max_col = col_max):


                                            # append the cell values to the list
                                            cell_values.append([cell.value for cell in row])

            # create a dataframe from the cell values and the worksheet name
            df = pd.DataFrame(cell_values, columns=range(start_col, end_col+1), index=[name]*len(cell_values))

            # append the df to the list
            df_list.append(df)


    # concatenate the list of dataframes into a single dataframe
    df = pd.concat(df_list)
    # save the output to a csv file
    df.to_csv('test.csv', index=True)

    return df

As an aside, is there a reason you’re manually creating a DF instead of just using pd.read_excel? If not, I’d recommend getting rid of your for row loop and just use pd.read_excel(exact_file_path, sheet_name=worksheet.title)

Answered By: Dean MacGregor
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.