reading excel to a python data frame starting from row 5 and including headers
Question:
I have an excel workbook that runs some vba on opening which refreshes a pivot table and does some other stuff.
Then I wish to import the results of the pivot table refresh into a dataframe in python for further analysis.
import xlrd
wb = xlrd.open_workbook('C:UserscbMachine_LearningcMap_Joins.xlsm')
The refreshing and opening of the file works fine. But how do I select the data from the first sheet from say row 5 including header down to last record n.
Answers:
You can use pandas’ ExcelFile parse
method to read Excel sheets, see io docs:
xls = pd.ExcelFile('C:UserscbMachine_LearningcMap_Joins.xlsm')
df = xls.parse('Sheet1', skiprows=4, index_col=None, na_values=['NA'])
skiprows
will ignore the first 4 rows (i.e. start at row index 4), and several other options.
The accepted answer is old (as discussed in comments of the accepted answer).
Now the preferred option is using pd.read_excel(). For example:
df = pandas.read_excel('C:UserscbMachine_LearningcMap_Joins.xlsm'), skiprows=[0,1,2,3,4])
The other answers skip the header together with the first 4 rows. To include the header, skiprows
should "skip" over it.
df = pd.read_excel('Book1.xlsx', skiprows=range(1, 5))
or
with pd.ExcelFile('Book1.xlsx') as f:
df = f.parse('Sheet1', skiprows=range(1,5))
should do the job.
I have an excel workbook that runs some vba on opening which refreshes a pivot table and does some other stuff.
Then I wish to import the results of the pivot table refresh into a dataframe in python for further analysis.
import xlrd
wb = xlrd.open_workbook('C:UserscbMachine_LearningcMap_Joins.xlsm')
The refreshing and opening of the file works fine. But how do I select the data from the first sheet from say row 5 including header down to last record n.
You can use pandas’ ExcelFile parse
method to read Excel sheets, see io docs:
xls = pd.ExcelFile('C:UserscbMachine_LearningcMap_Joins.xlsm')
df = xls.parse('Sheet1', skiprows=4, index_col=None, na_values=['NA'])
skiprows
will ignore the first 4 rows (i.e. start at row index 4), and several other options.
The accepted answer is old (as discussed in comments of the accepted answer).
Now the preferred option is using pd.read_excel(). For example:
df = pandas.read_excel('C:UserscbMachine_LearningcMap_Joins.xlsm'), skiprows=[0,1,2,3,4])
The other answers skip the header together with the first 4 rows. To include the header, skiprows
should "skip" over it.
df = pd.read_excel('Book1.xlsx', skiprows=range(1, 5))
or
with pd.ExcelFile('Book1.xlsx') as f:
df = f.parse('Sheet1', skiprows=range(1,5))
should do the job.