How to add a new field with the counts per group criteria in python polars?
Question:
I have a small use case and here is a polars dataframe.
df_names = pl.DataFrame({'LN'['Mallesham','Bhavik','Mallesham','Bhavik','Mahesh','Naresh','Sharath','Rakesh','Mallesham'],
'FN':['Yamulla','Yamulla','Yamulla','Yamulla','Dayala','Burre','Velmala','Uppu','Yamulla'],
'SSN':['123','456','123','456','893','111','222','333','123'],
'Address':['A','B','C','D','E','F','G','H','S']})
Here I would like to group on LN,FN,SSN and create a new column in which how many number of observations for this group combination and below is the expected output.
‘Mallesham’,’Yamulla’,’123′ is appeared 3 times, hence LN_FN_SSN_count field is filled up with 3.
Answers:
You can use an expression using over
(which is like grouping, aggregating and self-joining in other libs, but without the need for the join):
df_names.with_column(pl.count().over(['LN', 'FN', 'SSN']).alias('LN_FN_SSN_count'))
┌───────────┬─────────┬─────┬─────────┬─────────────────┐
│ LN ┆ FN ┆ SSN ┆ Address ┆ LN_FN_SSN_count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ u32 │
╞═══════════╪═════════╪═════╪═════════╪═════════════════╡
│ Mallesham ┆ Yamulla ┆ 123 ┆ A ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bhavik ┆ Yamulla ┆ 456 ┆ B ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mallesham ┆ Yamulla ┆ 123 ┆ C ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bhavik ┆ Yamulla ┆ 456 ┆ D ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Naresh ┆ Burre ┆ 111 ┆ F ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Sharath ┆ Velmala ┆ 222 ┆ G ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Rakesh ┆ Uppu ┆ 333 ┆ H ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mallesham ┆ Yamulla ┆ 123 ┆ S ┆ 3 │
└───────────┴─────────┴─────┴─────────┴─────────────────┘
I have a small use case and here is a polars dataframe.
df_names = pl.DataFrame({'LN'['Mallesham','Bhavik','Mallesham','Bhavik','Mahesh','Naresh','Sharath','Rakesh','Mallesham'],
'FN':['Yamulla','Yamulla','Yamulla','Yamulla','Dayala','Burre','Velmala','Uppu','Yamulla'],
'SSN':['123','456','123','456','893','111','222','333','123'],
'Address':['A','B','C','D','E','F','G','H','S']})
Here I would like to group on LN,FN,SSN and create a new column in which how many number of observations for this group combination and below is the expected output.
‘Mallesham’,’Yamulla’,’123′ is appeared 3 times, hence LN_FN_SSN_count field is filled up with 3.
You can use an expression using over
(which is like grouping, aggregating and self-joining in other libs, but without the need for the join):
df_names.with_column(pl.count().over(['LN', 'FN', 'SSN']).alias('LN_FN_SSN_count'))
┌───────────┬─────────┬─────┬─────────┬─────────────────┐
│ LN ┆ FN ┆ SSN ┆ Address ┆ LN_FN_SSN_count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ u32 │
╞═══════════╪═════════╪═════╪═════════╪═════════════════╡
│ Mallesham ┆ Yamulla ┆ 123 ┆ A ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bhavik ┆ Yamulla ┆ 456 ┆ B ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mallesham ┆ Yamulla ┆ 123 ┆ C ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bhavik ┆ Yamulla ┆ 456 ┆ D ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Naresh ┆ Burre ┆ 111 ┆ F ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Sharath ┆ Velmala ┆ 222 ┆ G ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Rakesh ┆ Uppu ┆ 333 ┆ H ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mallesham ┆ Yamulla ┆ 123 ┆ S ┆ 3 │
└───────────┴─────────┴─────┴─────────┴─────────────────┘