Selecting/excluding sets of columns in pandas

Question:

I would like to create views or dataframes from an existing dataframe based on column selections.

For example, I would like to create a dataframe df2 from a dataframe df1 that holds all columns from it except two of them. I tried doing the following, but it didn’t work:

import numpy as np
import pandas as pd

# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# Try to create a second dataframe df2 from df with all columns except 'B' and D
my_cols = set(df.columns)
my_cols.remove('B').remove('D')

# This returns an error ("unhashable type: set")
df2 = df[my_cols]

What am I doing wrong? Perhaps more generally, what mechanisms does pandas have to support the picking and exclusions of arbitrary sets of columns from a dataframe?

Answers:

You just need to convert your set to a list

import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
my_cols = set(df.columns)
my_cols.remove('B')
my_cols.remove('D')
my_cols = list(my_cols)
df2 = df[my_cols]
Answered By: tacaswell

You don’t really need to convert that into a set:

cols = [col for col in df.columns if col not in ['B', 'D']]
df2 = df[cols]
Answered By: piggybox

Here’s how to create a copy of a DataFrame excluding a list of columns:

df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df2 = df.drop(['B', 'D'], axis=1)

But be careful! You mention views in your question, suggesting that if you changed df, you’d want df2 to change too. (Like a view would in a database.)

This method doesn’t achieve that:

>>> df.loc[0, 'A'] = 999 # Change the first value in df
>>> df.head(1)
     A         B         C         D
0  999 -0.742688 -1.980673 -0.920133
>>> df2.head(1) # df2 is unchanged. It's not a view, it's a copy!
          A         C
0  0.251262 -1.980673

Note also that this is also true of @piggybox’s method. (Although that method is nice and slick and Pythonic. I’m not doing it down!!)

For more on views vs. copies see this SO answer and this part of the Pandas docs which that answer refers to.

Answered By: LondonRob

Also have a look into the built-in DataFrame.filter function.

Minimalistic but greedy approach (sufficient for the given df):

df.filter(regex="[^BD]")

Conservative/lazy approach (exact matches only):

df.filter(regex="^(?!(B|D)$).*$")

Conservative and generic:

exclude_cols = ['B','C']
df.filter(regex="^(?!({0})$).*$".format('|'.join(exclude_cols)))
Answered By: Frank

You can either Drop the columns you do not need OR Select the ones you need

# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)

# drop by Name
df1 = df1.drop(['B', 'C'], axis=1)

# Select the ones you want
df1 = df[['a','d']]
Answered By: Amrita Sawant

There is a new index method called difference. It returns the original columns, with the columns passed as argument removed.

Here, the result is used to remove columns B and D from df:

df2 = df[df.columns.difference(['B', 'D'])]

Note that it’s a set-based method, so duplicate column names will cause issues, and the column order may be changed.


Advantage over drop: you don’t create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:

# may create a copy of the dataframe
subset = df.drop(['B', 'D'], axis=1).columns

# does not create a copy the dataframe
subset = df.columns.difference(['B', 'D'])

df = df.drop_duplicates(subset=subset)
Answered By: IanS

In a similar vein, when reading a file, one may wish to exclude columns upfront, rather than wastefully reading unwanted data into memory and later discarding them.

As of pandas 0.20.0, usecols now accepts callables.1 This update allows more flexible options for reading columns:

skipcols = [...]
read_csv(..., usecols=lambda x: x not in skipcols)

The latter pattern is essentially the inverse of the traditional usecols method – only specified columns are skipped.


Given

Data in a file

import numpy as np
import pandas as pd


df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

filename = "foo.csv"
df.to_csv(filename)

Code

skipcols = ["B", "D"]
df1 = pd.read_csv(filename, usecols=lambda x: x not in skipcols, index_col=0)
df1

Output

          A         C
0  0.062350  0.076924
1 -0.016872  1.091446
2  0.213050  1.646109
3 -1.196928  1.153497
4 -0.628839 -0.856529
...

Details

A DataFrame was written to a file. It was then read back as a separate DataFrame, now skipping unwanted columns (B and D).

Note that for the OP’s situation, since data is already created, the better approach is the accepted answer, which drops unwanted columns from an extant object. However, the technique presented here is most useful when directly reading data from files into a DataFrame.

A request was raised for a “skipcols” option in this issue and was addressed in a later issue.

Answered By: pylang

You have 4 columns A,B,C,D

Here is a better way to select the columns you need for the new dataframe:-

df2 = df1[['A','D']]

if you wish to use column numbers instead, use:-

df2 = df1[[0,3]]
Answered By: Kapil Marwaha

Another option, without dropping or filtering in a loop:

import numpy as np
import pandas as pd

# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]

# or more simply include columns:
df[['A', 'B']]

# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]

# or even simpler since 0.24
# with the caveat that it reorders columns alphabetically 
df[df.columns.difference(['C', 'D'])]
Answered By: MrE
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.