Select multiple columns by labels in pandas

Question:

I’ve been looking around for ways to select columns through the python documentation and the forums but every example on indexing columns are too simplistic.

Suppose I have a 10 x 10 dataframe

df = DataFrame(randn(10, 10), index=range(0,10), columns=['A', 'B', 'C', 'D','E','F','G','H','I','J'])

So far, all the documentations gives is just a simple example of indexing like

subset = df.loc[:,'A':'C']

or

subset = df.loc[:,'C':]

But I get an error when I try index multiple, non-sequential columns, like this

subset = df.loc[:,('A':'C', 'E')]

How would I index in Pandas if I wanted to select column A to C, E, and G to I? It appears that this logic will not work

subset = df.loc[:,('A':'C', 'E', 'G':'I')]

I feel that the solution is pretty simple, but I can’t get around this error. Thanks!

Asked By: Minh Mai

||

Answers:

Name- or Label-Based (using regular expression syntax)

df.filter(regex='[A-CEG-I]')   # does NOT depend on the column order

Note that any regular expression is allowed here, so this approach can be very general. E.g. if you wanted all columns starting with a capital or lowercase "A" you could use: df.filter(regex='^[Aa]')

Location-Based (depends on column order)

df[ list(df.loc[:,'A':'C']) + ['E'] + list(df.loc[:,'G':'I']) ]

Note that unlike the label-based method, this only works if your columns are alphabetically sorted. This is not necessarily a problem, however. For example, if your columns go ['A','C','B'], then you could replace 'A':'C' above with 'A':'B'.

The Long Way

And for completeness, you always have the option shown by @Magdalena of simply listing each column individually, although it could be much more verbose as the number of columns increases:

df[['A','B','C','E','G','H','I']]   # does NOT depend on the column order

Results for any of the above methods

          A         B         C         E         G         H         I
0 -0.814688 -1.060864 -0.008088  2.697203 -0.763874  1.793213 -0.019520
1  0.549824  0.269340  0.405570 -0.406695 -0.536304 -1.231051  0.058018
2  0.879230 -0.666814  1.305835  0.167621 -1.100355  0.391133  0.317467
Answered By: JohnE

Just pick the columns you want directly….

df[['A','E','I','C']]
Answered By: Magdalena

How do I select multiple columns by labels in pandas?

Multiple label-based range slicing is not easily supported with pandas, but position-based slicing is, so let’s try that instead:

loc = df.columns.get_loc
df.iloc[:, np.r_[loc('A'):loc('C')+1, loc('E'), loc('G'):loc('I')+1]]

          A         B         C         E         G         H         I
0 -1.666330  0.321260 -1.768185 -0.034774  0.023294  0.533451 -0.241990
1  0.911498  3.408758  0.419618 -0.462590  0.739092  1.103940  0.116119
2  1.243001 -0.867370  1.058194  0.314196  0.887469  0.471137 -1.361059
3 -0.525165  0.676371  0.325831 -1.152202  0.606079  1.002880  2.032663
4  0.706609 -0.424726  0.308808  1.994626  0.626522 -0.033057  1.725315
5  0.879802 -1.961398  0.131694 -0.931951 -0.242822 -1.056038  0.550346
6  0.199072  0.969283  0.347008 -2.611489  0.282920 -0.334618  0.243583
7  1.234059  1.000687  0.863572  0.412544  0.569687 -0.684413 -0.357968
8 -0.299185  0.566009 -0.859453 -0.564557 -0.562524  0.233489 -0.039145
9  0.937637 -2.171174 -1.940916 -1.553634  0.619965 -0.664284 -0.151388

Note that the +1 is added because when using iloc the rightmost index is exclusive.


Comments on Other Solutions

  • filter is a nice and simple method for OP’s headers, but this might not generalise well to arbitrary column names.

  • The "location-based" solution with loc is a little closer to the ideal, but you cannot avoid creating intermediate DataFrames (that are eventually thrown out and garbage collected) to compute the final result range — something that we would ideally like to avoid.

  • Lastly, "pick your columns directly" is good advice as long as you have a manageably small number of columns to pick. It will, however not be applicable in some cases where ranges span dozens (or possibly hundreds) of columns.

Answered By: cs95

One option for selecting multiple slices is with select_columns from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor
from numpy import random
random.seed(3)
df = pd.DataFrame(
            random.randn(10, 10), 
            index=range(0,10), 
            columns=['A', 'B', 'C', 'D','E','F','G','H','I','J']
            )

df.select_columns(slice('A', 'C'), 'E', slice('G', 'I'))

          A         B         C         E         G         H         I
0  1.788628  0.436510  0.096497 -0.277388 -0.082741 -0.627001 -0.043818
1 -1.313865  0.884622  0.881318  0.050034 -0.545360 -1.546477  0.982367
2 -1.185047 -0.205650  1.486148 -1.023785  0.625245 -0.160513 -0.768836
3  0.745056  1.976111 -1.244123 -0.803766 -0.923792 -1.023876  1.123978
4 -1.623285  0.646675 -0.356271 -0.596650 -0.873882  0.029714 -2.248258
5  1.013183  0.852798  1.108187  1.487543  0.845833 -1.860890 -0.602885
6  1.048148  1.333738 -0.197415 -0.674728  0.152946 -1.064195  0.437947
7 -1.024931  0.899338 -0.154507  0.483788  0.643163  0.249087 -1.395764
8 -1.370669  0.238563  0.614077  0.145063 -0.024104 -0.888657 -2.915738
9 -0.591079 -0.516417 -0.959996 -0.574708  0.679072 -0.855437 -0.300206

The caveat here is that you have to explicitly use python’s builtin slice.

Just like the excellent chosen answer, you can use regular expressions, again, it is explicit use (python’s re):

import re

 df.select_columns(re.compile('[A-CEG-I]'))

          A         B         C         E         G         H         I
0  1.788628  0.436510  0.096497 -0.277388 -0.082741 -0.627001 -0.043818
1 -1.313865  0.884622  0.881318  0.050034 -0.545360 -1.546477  0.982367
2 -1.185047 -0.205650  1.486148 -1.023785  0.625245 -0.160513 -0.768836
3  0.745056  1.976111 -1.244123 -0.803766 -0.923792 -1.023876  1.123978
4 -1.623285  0.646675 -0.356271 -0.596650 -0.873882  0.029714 -2.248258
5  1.013183  0.852798  1.108187  1.487543  0.845833 -1.860890 -0.602885
6  1.048148  1.333738 -0.197415 -0.674728  0.152946 -1.064195  0.437947
7 -1.024931  0.899338 -0.154507  0.483788  0.643163  0.249087 -1.395764
8 -1.370669  0.238563  0.614077  0.145063 -0.024104 -0.888657 -2.915738
9 -0.591079 -0.516417 -0.959996 -0.574708  0.679072 -0.855437 -0.300206

You can go crazy and combine different selection options within the select_columns method.

Answered By: sammywemmy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.