How to ignore EMPTY/NULL value columns while grouping in python polars?
Question:
I have a dataframe.
df_X = pl.DataFrame({'last_name':['James','Warner','Marino','James','Warner','Marino','James'],
'first_name':['Horn','Bro','Kach','Horn','Bro','Kach','Horn'],
'dob':['03/06/1990','09/16/1990','03/06/1990','','03/06/1990','','']}
)
I’m applying a grouping on last,first and dob columns to get the counts as
df_X.groupby(['last_name','first_name','dob']).agg(pl.count())
Here i would like to ignore the NULL/EMPTY values on grouping columns as
James Horn has two empty DOB’s these should not be included to grouping operation.
Here is the expected output.
Of course we can do filter on the column as below before pass to grouping as
df_X.filter(pl.col('dob')!="").groupby(['last_name','first_name','dob']).agg(pl.count())
But what if I have 10 columns to be specified in filter operation ? i need to write them one after another.
Is there any other solution for it ?
Answers:
First replace empty strings with null
values and then use drop_nulls
(
df_X
.with_columns(
[
pl.when(pl.col(group_columns).str.lengths() ==0)
.then(None)
.otherwise(pl.col(group_columns))
.keep_name()
]
)
.drop_nulls(group_columns)
.groupby(group_columns)
.count()
)
shape: (4, 4)
┌───────────┬────────────┬────────────┬───────┐
│ last_name ┆ first_name ┆ dob ┆ count │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ u32 │
╞═══════════╪════════════╪════════════╪═══════╡
│ Warner ┆ Bro ┆ 09/16/1990 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Marino ┆ Kach ┆ 03/06/1990 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Warner ┆ Bro ┆ 03/06/1990 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ James ┆ Horn ┆ 03/06/1990 ┆ 1 │
└───────────┴────────────┴────────────┴───────┘
I have a dataframe.
df_X = pl.DataFrame({'last_name':['James','Warner','Marino','James','Warner','Marino','James'],
'first_name':['Horn','Bro','Kach','Horn','Bro','Kach','Horn'],
'dob':['03/06/1990','09/16/1990','03/06/1990','','03/06/1990','','']}
)
I’m applying a grouping on last,first and dob columns to get the counts as
df_X.groupby(['last_name','first_name','dob']).agg(pl.count())
Here i would like to ignore the NULL/EMPTY values on grouping columns as
James Horn has two empty DOB’s these should not be included to grouping operation.
Here is the expected output.
Of course we can do filter on the column as below before pass to grouping as
df_X.filter(pl.col('dob')!="").groupby(['last_name','first_name','dob']).agg(pl.count())
But what if I have 10 columns to be specified in filter operation ? i need to write them one after another.
Is there any other solution for it ?
First replace empty strings with null
values and then use drop_nulls
(
df_X
.with_columns(
[
pl.when(pl.col(group_columns).str.lengths() ==0)
.then(None)
.otherwise(pl.col(group_columns))
.keep_name()
]
)
.drop_nulls(group_columns)
.groupby(group_columns)
.count()
)
shape: (4, 4)
┌───────────┬────────────┬────────────┬───────┐
│ last_name ┆ first_name ┆ dob ┆ count │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ u32 │
╞═══════════╪════════════╪════════════╪═══════╡
│ Warner ┆ Bro ┆ 09/16/1990 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Marino ┆ Kach ┆ 03/06/1990 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Warner ┆ Bro ┆ 03/06/1990 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ James ┆ Horn ┆ 03/06/1990 ┆ 1 │
└───────────┴────────────┴────────────┴───────┘