How would I group, summarize and filter a DF in pandas in dplyr-fashion?

Question

I’m currently studying pandas and I come from an R/dplyr/tidyverse background.

Pandas has a not-so-intuitive API and how would I elegantly rewrite such operation from dplyr using pandas syntax?

library("nycflights13")
library("tidyverse")

delays <- flights %>%
  group_by(dest) %>%
  summarize(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20, dest != "HNL")

Asked By: Pedro Vinícius

||

Source

Answer 1

We can write a pandas concatenation of functions and methods that results in the same dataframe/tibble:

delays = (
    flights
    .groupby('dest', as_index=False)
    .agg({
        'year': 'count',
        'distance': 'mean',
        'arr_delay': 'mean',
    })
    .rename(columns={
        'year': 'count',
        'distance': 'dist',
        'arr_delay': 'delay',
    })
    .query('count > 20 & dest != "HNL"')
    .reset_index(drop=True)
)

It’s more lengthy: Pandas’ pd.DataFrame.agg method doesn’t allow much flexibility for changing columns’ names in the method itself.

But it’s as elegant, clean and clear as pandas allows us to go.

Answered By: Pedro Vinícius

Answer 2

pd.DataFrame.agg method doesn’t allow much flexibility for changing columns’ names in the method itself

That’s not exactly true. You could actually rename the columns inside agg similar to in R although it is a better idea to not use count as a column name as it is also an attribute:

    delays = (
    flights
    .groupby('dest', as_index=False)
    .agg(
        count=('year', 'count'),
        dist=('distance', 'mean'),
        delay=('arr_delay', 'mean'))
    .query('count > 20 & dest != "HNL"')
    .reset_index(drop=True)
)

Answered By: Nuri Taş

How would I group, summarize and filter a DF in pandas in dplyr-fashion?

Question:

Answers: