How to convert a worksheet to a Data frame in Pandas?
Question:
I am trying to read different worksheets from an Excel workbook in Python with Pandas. When I read the entire workbook and then I want to apply a .merge() then the first worksheet is read but the others are not considered. I tried to read each worksheet of the workbook but I guess they were not successfully converted to data frames because when I apply .merge() I end up with the following error: ValueError: Invalid file path or buffer object type: <class 'pandas.core.frame.DataFrame'>
This is what I have done so far:
This code works for converting the entire workbook to a data frame but only the data of the first worksheet is processed
import pandas as pd
import pypyodbc
#sql extractor
start_date = date.today()
retrieve_values = "[DEV].[CS].[QT_KPIExport] @start_date='{start_date:%Y-%m-%d}'".format(
start_date=start_date)
connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="X",pwd="xxx", Trusted_Connection="No")
data_frame_sql = pd.read_sql(retrieve_values, connection)
#Read the entire workbook
wb_data = pd.ExcelFile("C:\Users\Dev\Testing\Daily_Data\NSN-Daily Data Report.xlsx")
#Convert to a dataframe the entire workbook
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=['NA'],parse_cols="J")
#apply merge
merged_df = data_frame_sql.merge(data_frame_excel,how="inner",on="sectorname")
This code tries to read the different worksheets and convert them to data frames with no success…yet! (check the answer below)
data_frame_sql = pd.read_sql(retrieve_values, connection)
#Method 1: Tried to parse worksheet 2
#Read the entire workbook and select the specific worksheet
wb_data = pd.ExcelFile("C:\Users\Dev\Testing\Daily_Data\NSN-Daily Data Report.xlsx", sheetname="Sheet-2")
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=['NA'],parse_cols="J")
#apply merge
merged_df = data_frame_sql.merge(data_frame_excel,how="inner",on="sectorname")
#No success... the data of the first sheet is read
#Method 2: Tried to parse worksheet 2
#Read the entire workbook
wb_data = pd.ExcelFile("C:\Users\Dev\Testing\Daily_Data\NSN-Daily Data Report.xlsx")
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=['NA'],parse_cols="J")
#select one specific sheet
ws_sheet_2 = wb_data.parse("Sheet-2")
#apply merge
merged_df = data_frame_sql.merge(ws_sheet_2,how="inner",on="sectorname")
# No success.... ValueError: Invalid file path or buffer object type: <class 'pandas.core.frame.DataFrame'>
Any help or advice is greatly appreciated.
Answers:
I found out a solution that did the trick.
#Method 1: Add the sheetname once you have read the entire workbook
#Read the entire workbook
wb_data = pd.ExcelFile("C:\Users\Dev\Testing\Daily_Data\NSN-Daily Data
Report.xlsx")
#Select your sheetname to read
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=
['NA'],parse_cols="J" sheetname="Sheet-2")
#apply merge
merged_df =
data_frame_sql.merge(data_frame_excel,how="inner",on="sectorname")
You can get all worksheets from a workbook into a dictionary by using the sheetname=None argument with the read_excel method. Key/value pairs will be ws name/dataframe.
ws_dict = pd.read_excel('excel_file.xlsx', sheetname=None)
Note the sheetname argument will change to sheet_name in future pandas versions…
To read .xlsx files in Pandas, for a document with multiple sheets, specify the sheet name and use a different engine.
Step 1 (install the openpyxl package):
! pip install openpyxl
Step 2 (use the openpyxl engine):
data_df = pd.read_excel(<ARCHIVE_PATH>, sheetname= <sheet_name>, engine='openpyxl')
Here is the official documentation.
Another solution using openpyxl directly:
wb = load_workbook(ARCHIVE_PATH)
ws = wb[<sheet-name>]
data_df = pd.DataFrame(ws.values)
df_tm = sheet.values
coluna_tm = next(df_tm)[0:]
df = pd.DataFrame(df_tm, columns=coluna_tm)
I am trying to read different worksheets from an Excel workbook in Python with Pandas. When I read the entire workbook and then I want to apply a .merge() then the first worksheet is read but the others are not considered. I tried to read each worksheet of the workbook but I guess they were not successfully converted to data frames because when I apply .merge() I end up with the following error: ValueError: Invalid file path or buffer object type: <class 'pandas.core.frame.DataFrame'>
This is what I have done so far:
This code works for converting the entire workbook to a data frame but only the data of the first worksheet is processed
import pandas as pd
import pypyodbc
#sql extractor
start_date = date.today()
retrieve_values = "[DEV].[CS].[QT_KPIExport] @start_date='{start_date:%Y-%m-%d}'".format(
start_date=start_date)
connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="X",pwd="xxx", Trusted_Connection="No")
data_frame_sql = pd.read_sql(retrieve_values, connection)
#Read the entire workbook
wb_data = pd.ExcelFile("C:\Users\Dev\Testing\Daily_Data\NSN-Daily Data Report.xlsx")
#Convert to a dataframe the entire workbook
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=['NA'],parse_cols="J")
#apply merge
merged_df = data_frame_sql.merge(data_frame_excel,how="inner",on="sectorname")
This code tries to read the different worksheets and convert them to data frames with no success…yet! (check the answer below)
data_frame_sql = pd.read_sql(retrieve_values, connection)
#Method 1: Tried to parse worksheet 2
#Read the entire workbook and select the specific worksheet
wb_data = pd.ExcelFile("C:\Users\Dev\Testing\Daily_Data\NSN-Daily Data Report.xlsx", sheetname="Sheet-2")
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=['NA'],parse_cols="J")
#apply merge
merged_df = data_frame_sql.merge(data_frame_excel,how="inner",on="sectorname")
#No success... the data of the first sheet is read
#Method 2: Tried to parse worksheet 2
#Read the entire workbook
wb_data = pd.ExcelFile("C:\Users\Dev\Testing\Daily_Data\NSN-Daily Data Report.xlsx")
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=['NA'],parse_cols="J")
#select one specific sheet
ws_sheet_2 = wb_data.parse("Sheet-2")
#apply merge
merged_df = data_frame_sql.merge(ws_sheet_2,how="inner",on="sectorname")
# No success.... ValueError: Invalid file path or buffer object type: <class 'pandas.core.frame.DataFrame'>
Any help or advice is greatly appreciated.
I found out a solution that did the trick.
#Method 1: Add the sheetname once you have read the entire workbook
#Read the entire workbook
wb_data = pd.ExcelFile("C:\Users\Dev\Testing\Daily_Data\NSN-Daily Data
Report.xlsx")
#Select your sheetname to read
data_frame_excel = pd.read_excel(wb_data,index_col=None,na_values=
['NA'],parse_cols="J" sheetname="Sheet-2")
#apply merge
merged_df =
data_frame_sql.merge(data_frame_excel,how="inner",on="sectorname")
You can get all worksheets from a workbook into a dictionary by using the sheetname=None argument with the read_excel method. Key/value pairs will be ws name/dataframe.
ws_dict = pd.read_excel('excel_file.xlsx', sheetname=None)
Note the sheetname argument will change to sheet_name in future pandas versions…
To read .xlsx files in Pandas, for a document with multiple sheets, specify the sheet name and use a different engine.
Step 1 (install the openpyxl package):
! pip install openpyxl
Step 2 (use the openpyxl engine):
data_df = pd.read_excel(<ARCHIVE_PATH>, sheetname= <sheet_name>, engine='openpyxl')
Here is the official documentation.
Another solution using openpyxl directly:
wb = load_workbook(ARCHIVE_PATH)
ws = wb[<sheet-name>]
data_df = pd.DataFrame(ws.values)
df_tm = sheet.values
coluna_tm = next(df_tm)[0:]
df = pd.DataFrame(df_tm, columns=coluna_tm)