How would I group, summarize and filter a DF in pandas in dplyr-fashion?

Question:

I’m currently studying pandas and I come from an R/dplyr/tidyverse background.

Pandas has a not-so-intuitive API and how would I elegantly rewrite such operation from dplyr using pandas syntax?

library("nycflights13")
library("tidyverse")

delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20, dest != "HNL")
Asked By: Pedro Vinícius

||

Answers:

We can write a pandas concatenation of functions and methods that results in the same dataframe/tibble:

delays = (
    flights
    .groupby('dest', as_index=False)
    .agg({
        'year': 'count',
        'distance': 'mean',
        'arr_delay': 'mean',
    })
    .rename(columns={
        'year': 'count',
        'distance': 'dist',
        'arr_delay': 'delay',
    })
    .query('count > 20 & dest != "HNL"')
    .reset_index(drop=True)
)

It’s more lengthy: Pandas’ pd.DataFrame.agg method doesn’t allow much flexibility for changing columns’ names in the method itself.

But it’s as elegant, clean and clear as pandas allows us to go.

Answered By: Pedro Vinícius

pd.DataFrame.agg method doesn’t allow much flexibility for changing columns’ names in the method itself

That’s not exactly true. You could actually rename the columns inside agg similar to in R although it is a better idea to not use count as a column name as it is also an attribute:

    delays = (
    flights
    .groupby('dest', as_index=False)
    .agg(
        count=('year', 'count'),
        dist=('distance', 'mean'),
        delay=('arr_delay', 'mean'))
    .query('count > 20 & dest != "HNL"')
    .reset_index(drop=True)
)
Answered By: Nuri Taş
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.