Skip specific set of columns when reading excel frame – pandas
Question:
I know beforehand what columns I don’t need from an excel file and I’d like to avoid them when reading the file to improve the performance. Something like this:
import pandas as pd
df = pd.read_excel('large_excel_file.xlsx', skip_cols=['col_a', 'col_b',...,'col_zz'])
There is nothing related to this in the documentation. Is there any workaround for this?
Answers:
You can use the following technique. Let the columns we don’t want(want to skip) are 2 5 8, then find all reamining columns we DO WANT TO KEEP as cols
such that:
In [7]: cols2skip = [2,5,8]
In [8]: cols = [i for i in range(10) if i not in cols2skip]
In [9]: cols
Out[9]: [0, 1, 3, 4, 6, 7, 9]
and then we can use those remaining columns(which we DO WANT TO KEEP) using usecols
:
df = pd.read_excel(filename, usecols=cols)
If your version of pandas allows (check first if you can pass a function to usecols), I would try something like:
import pandas as pd
df = pd.read_excel('large_excel_file.xlsx', usecols=lambda x: 'Unnamed' not in x,)
This should skip all columns without header names. You could substitute ‘Unnamed’ with a list of column names you do not want.
If you’re using read_excel
, it’s certainly easier to use a function to filter columns as in MarMat‘s answer. Their answer should be changed a bit to fit the case in the OP where skip_cols
should be left out.
skip_cols = ['col_a', 'col_b',...,'col_zz']
df = pd.read_excel('Book1.xlsx', usecols=lambda x: x not in skip_cols)
You can also read only the columns of an Excel file by passing nrows=0
and use Index.difference
to extract the columns you need.
cols = pd.read_excel('Book1.xlsx', nrows=0).columns
df = pd.read_excel('Book1.xlsx', usecols=cols.difference(skip_cols))
I know beforehand what columns I don’t need from an excel file and I’d like to avoid them when reading the file to improve the performance. Something like this:
import pandas as pd
df = pd.read_excel('large_excel_file.xlsx', skip_cols=['col_a', 'col_b',...,'col_zz'])
There is nothing related to this in the documentation. Is there any workaround for this?
You can use the following technique. Let the columns we don’t want(want to skip) are 2 5 8, then find all reamining columns we DO WANT TO KEEP as cols
such that:
In [7]: cols2skip = [2,5,8]
In [8]: cols = [i for i in range(10) if i not in cols2skip]
In [9]: cols
Out[9]: [0, 1, 3, 4, 6, 7, 9]
and then we can use those remaining columns(which we DO WANT TO KEEP) using usecols
:
df = pd.read_excel(filename, usecols=cols)
If your version of pandas allows (check first if you can pass a function to usecols), I would try something like:
import pandas as pd
df = pd.read_excel('large_excel_file.xlsx', usecols=lambda x: 'Unnamed' not in x,)
This should skip all columns without header names. You could substitute ‘Unnamed’ with a list of column names you do not want.
If you’re using read_excel
, it’s certainly easier to use a function to filter columns as in MarMat‘s answer. Their answer should be changed a bit to fit the case in the OP where skip_cols
should be left out.
skip_cols = ['col_a', 'col_b',...,'col_zz']
df = pd.read_excel('Book1.xlsx', usecols=lambda x: x not in skip_cols)
You can also read only the columns of an Excel file by passing nrows=0
and use Index.difference
to extract the columns you need.
cols = pd.read_excel('Book1.xlsx', nrows=0).columns
df = pd.read_excel('Book1.xlsx', usecols=cols.difference(skip_cols))