Extracting neccesary columns from multiple Excel files in Python

Question

I am trying to extract and combine selected columns from 19 Excel files into single excel file. Am able to extract required columns from single file with below code.

import pandas as pd
import openpyxl

file = pd.read_excel("Shift Handover To A - 05-25-2021.xlsx", "25th May")

dataframe=pd.DataFrame(file[["S No", "Issue Reported By", "Shift", "Severity", "ServiceDesk Ticket #", "Issue Description", "Issue Type", "System Component", "Server Type", "Date and Time of the occurrence", "DT Observed", "Action Taken", "Worked By", "DT Action Taken", "Date and Time Resolution", "Current Stus"]])

# selecting rows based on condition
rslt_df = dataframe.loc[dataframe['Current Stus'] == 'In-Progress' ]

rslt_df.to_excel('output.xlsx')

Am trying to apply it for all files with below code,

import os
import pandas as pd
cwd = os.path.abspath('')
import openpyxl
files = os.listdir(cwd)

for file in files:
    if file.startswith('Shift'):
        file = pd.read_excel(os.path.join(cwd, file))
dataframe=pd.DataFrame(file[["S No", "Issue Reported By", "Shift", "Severity", "ServiceDesk Ticket #", "Issue Description", "Issue Type", "System Component", "Server Type", "Date and Time of the occurrence", "DT Observed", "Action Taken", "Worked By", "DT Action Taken", "Date and Time Resolution", "Current Stus"]])

# selecting rows based on condition
rslt_df = dataframe.loc[dataframe['Current Stus'] == 'In-Progress' ]

#print(rslt_df)
rslt_df.to_excel('output.xlsx')

But am receiving TypeError for dataframe=pd.DataFrame(file…..
"TypeError: string indices must be integers"
What could be wrong?

Asked By: Pavan Chakravarthy

||

Source

Answer 1

The problem with your code is in these lines:

for file in files:
    if file.startswith('Shift'):
        file = pd.read_excel(os.path.join(cwd, file))
dataframe=pd.DataFrame(file[["S No", ... "Current Stus"]])

You use ‘file’ as iterator (for file in files). When the loop ends, If file.startswith(‘Shift’) is not True, then file is a string, therefore file[["S No", … "Current Stus"]] will throw an error.

Just use another name for the dataframe

Answered By: IoaTzimas

Answer 2

You can try amend your codes as follows:

You need to define an empty dataframe and accumulate the results from each loop iteration by .append():

No need to call for pd.DataFrame after the loop, you can just select the columns you want and assign it back by dataframe = dataframe[["S No", ...]]

files = os.listdir(cwd)

dataframe = pd.DataFrame()
for file in files:
    if file.startswith('Shift'):
        file_read = pd.read_excel(os.path.join(cwd, file))
        dataframe = dataframe.append(file_read) 

dataframe = dataframe[["S No", "Issue Reported By", "Shift", "Severity", "ServiceDesk Ticket #", "Issue Description", "Issue Type", "System Component", "Server Type", "Date and Time of the occurrence", "DT Observed", "Action Taken", "Worked By", "DT Action Taken", "Date and Time Resolution", "Current Stus"]]

# selecting rows based on condition
rslt_df = dataframe.loc[dataframe['Current Stus'] == 'In-Progress' ]

#print(rslt_df)
rslt_df.to_excel('output.xlsx')

Answered By: SeaBean

Answer 3

Hi suppose I want to apply this condition : rslt_df = dataframe.loc[dataframe[‘Current Stus’] == ‘In-Progress’ ] on two columns like using a different condition on that column so how do I do that?

Answered By: oorja

Extracting neccesary columns from multiple Excel files in Python

Question:

Answers: