python pandas selecting columns from a dataframe via a list of column names


I have a dataframe with a lot of columns in it. Now I want to select only certain columns. I have saved all the names of the columns that I want to select into a Python list and now I want to filter my dataframe according to this list.

I’ve been trying to do:

df_new = df[[list]]

where list includes all the column names that I want to select.

However I get the error:

TypeError: unhashable type: 'list'

When I do:

df_new = df[list]

I get the error

KeyError: not in index

Any help on this one?

Asked By: AaronDT



You can remove one []:

df_new = df[list]

Also better is use other name as list, e.g. L:

df_new = df[L]

It look like working, I try only simplify it:

L = []
for x in df.columns: 
    if not "_" in x[-3:]: 
print (L)

List comprehension:

print ([x for x in df.columns if not "_" in x[-3:]])
Answered By: jezrael

1. [] aka __getitem__()

The canonical way to select a list of columns from a dataframe is via [].

df = pd.DataFrame({'a': [1, 1, 1, 1], 'b': [2, 2, 1, 0], 'c': [3, 3, 1, 0]})
cols = ['a', 'b']

df1 = df[cols]

Note that all column labels in cols have to also be df (otherwise KeyError: "... not in index" will be raised).

One thing to note is that when you want to assign new columns to df1 as filtered above (e.g. df1['new'] = 0), a SettingWithCopyWarning will be raised. To silence it, explicitly make a new copy:

df1 = df[cols].copy()

2. Handle KeyError: "... not in index"

To ensure cols contains only column labels that are in df, you can call isin on the columns and then filter df.

cols = ['a', 'b', 'f']
df1 = df[cols]                           # <----- error
df1 = df.loc[:, df.columns.isin(cols)]   # <----- OK

3. filter()

Another way to select a list of columns from a dataframe is via filter(). A nice thing about it is that it creates a copy (so no SettingWithCopyWarning) and only selects the column labels in cols that exist in the dataframe, so handles the KeyError under the hood.

cols = ['a', 'b', 'f']
df1 = df.filter(cols)

As can be seen from the output below, f in cols is ignored because it doesn’t exist as a column label in df.


Answered By: cottontail