In pandas, what would be the idiomatic way to select multiple columns based on different patterns?

Question:

I’m trying to replicate some of R4DS’s dplyr exercises using Python’s pandas, with the nycflights13.flights dataset. What I want to do is select, from that dataset:

  • Columns through year to day (inclusive);
  • All columns that end with "delay";
  • The distance and air_time columns

In the book, Hadley uses the following syntax:

library("tidyverse")
library("nycflights13")

flights_sml <- select(flights,
   year:day,
   ends_with("delay"),
   distance,
   air_time
)

In pandas, I came up with the following "solution":

import pandas as pd
from nycflights13 import flights

flights_sml = pd.concat([
    flights.loc[:, 'year':'day'],
    flights.loc[:, flights.columns.str.endswith("delay")],
    flights.distance,
    flights.air_time,
], axis=1)

Another possible implementation:

flights_sml = flights.filter(regex='year|day|month|delay$|^distance$|^air_time$', axis=1)

But I’m sure this is not the idiomatic way to write such DF-operation. I digged around, but haven’t found something that fits in this situation from pandas API.

Asked By: Pedro Vinícius

||

Answers:

You are correct. This will create multiple dataframes/series and then concatenate them together, resulting in a lot of extra work. Instead, you can create a list of the columns you want to use and then simply select those.

For example (keeping the same column order):

cols = ['year', 'month', 'day'] + [col for col in flights.columns if col.endswith('delay')] + ['distance', 'air_time']
flights_sml = flights[cols]
Answered By: Shaido

According to the dataset’s columns info, we can utilize str.contains

df.loc[:, df.columns.str.contains('year|month|day|delay|distance|air_time')]
   year  month  day  dep_delay  arr_delay  air_time  distance
0  2013      1    1        2.0       11.0     227.0      1400
1  2013      1    1        4.0       20.0     227.0      1416
2  2013      1    1        2.0       33.0     160.0      1089
3  2013      1    1       -1.0      -18.0     183.0      1576
4  2013      1    1       -6.0      -25.0     116.0       762
Answered By: Baron Legendre

One option is with select_columns from pyjanitor:

# pip install pyjanitor
import pandas as pd 
import janitor
from nycflights13 import flights

flights_sml = flights.select_columns(
                          slice('year', 'day'), 
                          '*delay', 
                          'distance', 
                          'air_time'
                       )

flights_sml.head()
   year  month  day  dep_delay  arr_delay  distance  air_time
0  2013      1    1        2.0       11.0      1400     227.0
1  2013      1    1        4.0       20.0      1416     227.0
2  2013      1    1        2.0       33.0      1089     160.0
3  2013      1    1       -1.0      -18.0      1576     183.0
4  2013      1    1       -6.0      -25.0       762     116.0
Answered By: sammywemmy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.