How can I read a range('A5:B10') and place these values into a dataframe using openpyxl
Question:
Being able to define the ranges in a manner similar to excel, i.e. ‘A5:B10’ is important to what I need so reading the entire sheet to a dataframe isn’t very useful.
So what I need to do is read the values from multiple ranges in the Excel sheet to multiple different dataframes.
valuerange1 = ['a5:b10']
valuerange2 = ['z10:z20']
df = pd.DataFrame(values from valuerange)
df = pd.DataFrame(values from valuerange1)
or
df = pd.DataFrame(values from ['A5:B10'])
I have searched but either I have done a very poor job of searching or everyone else has gotten around this problem but I really can’t.
Thanks.
Answers:
Using openpyxl
Since you have indicated, that you are looking into a very user friendly way to specify the range (like the excel-syntax) and as Charlie Clark already suggested, you can use openpyxl.
The following utility function takes a workbook and a column/row range and returns a pandas DataFrame:
from openpyxl import load_workbook
from openpyxl.utils import get_column_interval
import re
def load_workbook_range(range_string, ws):
col_start, col_end = re.findall("[A-Z]+", range_string)
data_rows = []
for row in ws[range_string]:
data_rows.append([cell.value for cell in row])
return pd.DataFrame(data_rows, columns=get_column_interval(col_start, col_end))
Usage:
wb = load_workbook(filename='excel-sheet.xlsx',
read_only=True)
ws = wb.active
load_workbook_range('B1:C2', ws)
Output:
B C
0 5 6
1 8 9
Pandas only Solution
Given the following data in an excel sheet:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
You can load it with the following command:
pd.read_excel('excel-sheet.xlsx')
If you were to limit the data being read, the pandas.read_excel method offers a number of options. Use the parse_cols
, skiprows
and skip_footer
to select the specific subset that you want to load:
pd.read_excel(
'excel-sheet.xlsx', # name of excel sheet
names=['B','C'], # new column header
skiprows=range(0,1), # list of rows you want to omit at the beginning
skip_footer=1, # number of rows you want to skip at the end
parse_cols='B:C' # columns to parse (note the excel-like syntax)
)
Output:
B C
0 5 6
1 8 9
Some notes:
The API of the read_excel method is not meant to support more complex selections. In case you require a complex filter it is much easier (and cleaner) to load the whole data into a DataFrame and use the excellent slicing and indexing mechanisms provided by pandas.
The most easiest way is to use pandas for getting the range of values from excel.
import pandas as pd
#if you want to choose single range, you can use the below method
src=pd.read_excel(r'August.xlsx',usecols='A:C',sheet_name='S')
#if you have multirange, which means a dataframe with A:S and as well some other range
src=pd.read_excel(r'August.xlsx',usecols='A:C,G:I',sheet_name='S')
If you want to use particular range, for example "B3:E5", you can use the following structure.
src=pd.read_excel(r'August.xlsx',usecols='B:E',sheet_name='S',header=2)[0:2]
Being able to define the ranges in a manner similar to excel, i.e. ‘A5:B10’ is important to what I need so reading the entire sheet to a dataframe isn’t very useful.
So what I need to do is read the values from multiple ranges in the Excel sheet to multiple different dataframes.
valuerange1 = ['a5:b10']
valuerange2 = ['z10:z20']
df = pd.DataFrame(values from valuerange)
df = pd.DataFrame(values from valuerange1)
or
df = pd.DataFrame(values from ['A5:B10'])
I have searched but either I have done a very poor job of searching or everyone else has gotten around this problem but I really can’t.
Thanks.
Using openpyxl
Since you have indicated, that you are looking into a very user friendly way to specify the range (like the excel-syntax) and as Charlie Clark already suggested, you can use openpyxl.
The following utility function takes a workbook and a column/row range and returns a pandas DataFrame:
from openpyxl import load_workbook
from openpyxl.utils import get_column_interval
import re
def load_workbook_range(range_string, ws):
col_start, col_end = re.findall("[A-Z]+", range_string)
data_rows = []
for row in ws[range_string]:
data_rows.append([cell.value for cell in row])
return pd.DataFrame(data_rows, columns=get_column_interval(col_start, col_end))
Usage:
wb = load_workbook(filename='excel-sheet.xlsx',
read_only=True)
ws = wb.active
load_workbook_range('B1:C2', ws)
Output:
B C
0 5 6
1 8 9
Pandas only Solution
Given the following data in an excel sheet:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
You can load it with the following command:
pd.read_excel('excel-sheet.xlsx')
If you were to limit the data being read, the pandas.read_excel method offers a number of options. Use the parse_cols
, skiprows
and skip_footer
to select the specific subset that you want to load:
pd.read_excel(
'excel-sheet.xlsx', # name of excel sheet
names=['B','C'], # new column header
skiprows=range(0,1), # list of rows you want to omit at the beginning
skip_footer=1, # number of rows you want to skip at the end
parse_cols='B:C' # columns to parse (note the excel-like syntax)
)
Output:
B C
0 5 6
1 8 9
Some notes:
The API of the read_excel method is not meant to support more complex selections. In case you require a complex filter it is much easier (and cleaner) to load the whole data into a DataFrame and use the excellent slicing and indexing mechanisms provided by pandas.
The most easiest way is to use pandas for getting the range of values from excel.
import pandas as pd
#if you want to choose single range, you can use the below method
src=pd.read_excel(r'August.xlsx',usecols='A:C',sheet_name='S')
#if you have multirange, which means a dataframe with A:S and as well some other range
src=pd.read_excel(r'August.xlsx',usecols='A:C,G:I',sheet_name='S')
If you want to use particular range, for example "B3:E5", you can use the following structure.
src=pd.read_excel(r'August.xlsx',usecols='B:E',sheet_name='S',header=2)[0:2]