Selecting columns by list (and columns are subset of list)

Question:

I’m selecting several columns of a dataframe, by a list of the column names. This works fine if all elements of the list are in the dataframe.
But if some elements of the list are not in the DataFrame, then it will generate the error "not in index".

Is there a way to select all columns which included in that list, even if not all elements of the list are included in the dataframe? Here is some sample data which generates the above error:

df   = pd.DataFrame( [[0,1,2]], columns=list('ABC') )

lst  = list('ARB')

data = df[lst]       # error: not in index
Asked By: csander

||

Answers:

I think you need Index.intersection:

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9],
                   'D':[1,3,5],
                   'E':[5,3,6],
                   'F':[7,4,3]})

print (df)
   A  B  C  D  E  F
0  1  4  7  1  5  7
1  2  5  8  3  3  4
2  3  6  9  5  6  3

lst = ['A','R','B']

print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')

data = df[df.columns.intersection(lst)]
print (data)
   A  B
0  1  4
1  2  5
2  3  6

Another solution with numpy.intersect1d:

data = df[np.intersect1d(df.columns, lst)]
print (data)
   A  B
0  1  4
1  2  5
2  3  6
Answered By: jezrael

Few other ways, and list comprehension is much faster

In [1357]: df[df.columns & lst]
Out[1357]:
   A  B
0  1  4
1  2  5
2  3  6

In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
   A  B
0  1  4
1  2  5
2  3  6

Timings

In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop

In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop

In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop

In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop

Details

In [1365]: df
Out[1365]:
   A  B  C  D  E  F
0  1  4  7  1  5  7
1  2  5  8  3  3  4
2  3  6  9  5  6  3

In [1366]: lst
Out[1366]: ['A', 'R', 'B']
Answered By: Zero

Use * with list

data = df[[*lst]]

It will give the desired result.

Answered By: Avinash

please try this:

syntax : Dataframe[[List of Columns]]

for example : df[[‘a’,’b’]]

a

Out[5]: 
    a  b   c
0   1  2   3
1  12  3  44

X is the list of req columns to slice

x = ['a','b']

this would give you the req slice:

a[x]

Out[7]: 
    a  b
0   1  2
1  12  3

Performance:

%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

A really simple solution here is to use filter(). In your example, just type:

df.filter(lst)

and it will automatically ignore any missing columns. For more, see the documentation for filter.

As a general note, filter is a very flexible and powerful way to select specific columns. In particular, you can use regular expressions. Borrowing the sample data from @jezrael, you could type either of the following.

df.filter(regex='A|R|B')
df.filter(regex='[ARB]')

Those are trivial examples, but suppose you wanted only columns starting with those letters, then you could type:

df.filter(regex='^[ARB]')

FWIW, in some quick timings I find this to be faster than the list comprehension method, but I don’t think speed is really much of a concern here — even the slowest way should be fast enough, as the speed does not depend on the size of the dataframe, only on the number of columns.

Honestly, all of these ways are fine and you can go with whatever is most readable to you. I prefer filter because it is simple while also giving you more options for selecting columns than a simple intersection.

Answered By: JohnE