How to get max value record per a group in polars dataframe?

Question:

I have a dataframe as:

pl.DataFrame(
[{'row_nr': 0, 'middle_name_or_initial': 'R.', 'count': 8},
 {'row_nr': 0, 'middle_name_or_initial': 'B.', 'count': 1},
 {'row_nr': 1, 'middle_name_or_initial': 'D.', 'count': 1},
 {'row_nr': 1, 'middle_name_or_initial': 'J.', 'count': 11},
 {'row_nr': 2, 'middle_name_or_initial': 'Micha', 'count': 1},
 {'row_nr': 2, 'middle_name_or_initial': 'M.', 'count': 1},
 {'row_nr': 2, 'middle_name_or_initial': 'J.', 'count': 1},
 {'row_nr': 3, 'middle_name_or_initial': 'E.', 'count': 1},
 {'row_nr': 3, 'middle_name_or_initial': 'S.', 'count': 1},
 {'row_nr': 4, 'middle_name_or_initial': 'M.', 'count': 1},
 {'row_nr': 4, 'middle_name_or_initial': 'P.', 'count': 1},
 {'row_nr': 5, 'middle_name_or_initial': 'Christopher', 'count': 15},
 {'row_nr': 5, 'middle_name_or_initial': 'Robert', 'count': 1},
 {'row_nr': 6, 'middle_name_or_initial': 'Lusi', 'count': 1},
 {'row_nr': 6, 'middle_name_or_initial': 'Luis', 'count': 1}])

Here I would like to select the observations on a condition as: on grouping row_nr and middle_name the highest count observation to be given as output per a group.

if the counts are equal i.e 1 all of these group rows will be returned.

expected output as

enter image description here

Asked By: myamulla_ciencia

||

Answers:

import polars as pl
tmp = pl.DataFrame(
[{'row_nr': 0, 'middle_name_or_initial': 'R.', 'count': 8},
 {'row_nr': 0, 'middle_name_or_initial': 'B.', 'count': 1},
 {'row_nr': 1, 'middle_name_or_initial': 'D.', 'count': 1},
 {'row_nr': 1, 'middle_name_or_initial': 'J.', 'count': 11},
 {'row_nr': 2, 'middle_name_or_initial': 'Micha', 'count': 1},
 {'row_nr': 2, 'middle_name_or_initial': 'M.', 'count': 1},
 {'row_nr': 2, 'middle_name_or_initial': 'J.', 'count': 1},
 {'row_nr': 3, 'middle_name_or_initial': 'E.', 'count': 1},
 {'row_nr': 3, 'middle_name_or_initial': 'S.', 'count': 1},
 {'row_nr': 4, 'middle_name_or_initial': 'M.', 'count': 1},
 {'row_nr': 4, 'middle_name_or_initial': 'P.', 'count': 1},
 {'row_nr': 5, 'middle_name_or_initial': 'Christopher', 'count': 15},
 {'row_nr': 5, 'middle_name_or_initial': 'Robert', 'count': 1},
 {'row_nr': 6, 'middle_name_or_initial': 'Lusi', 'count': 1},
 {'row_nr': 6, 'middle_name_or_initial': 'Luis', 'count': 1}])

    df = (
    tmp
    .groupby(['row_nr','middle_name_or_initial'])
    .agg(
        [
            pl.col("count").max(),
        ]
    )
    )

df

Output:

enter image description here

Answered By: Will

Here you go. It would have been nice if you had a clearer example, with fewer rows and more cases of count != 1, as well as the expected solution as code rather than screenshot so the responder can test their solution (using test_df.frame_equal(solution_df)). See below for a suggestion for how to create this.

I also think you could have phrased your question better: You said

on grouping row_nr and middle_name the highest count observation to be given as output per a group

but your suggested solution implies that you want to group on row_nr and select the rows containing the highest count. "grouping row_nr and middle_name" is what I believe set @Will on his track.

import polars as pl

pl.Config.set_tbl_rows(100) # set max print row-number high enough

tmp = pl.DataFrame(
    [
        {"row_nr": 0, "middle_name_or_initial": "R.", "count": 8},
        {"row_nr": 0, "middle_name_or_initial": "B.", "count": 1},
        {"row_nr": 1, "middle_name_or_initial": "D.", "count": 1},
        {"row_nr": 1, "middle_name_or_initial": "J.", "count": 11},
        {"row_nr": 2, "middle_name_or_initial": "Micha", "count": 1},
        {"row_nr": 2, "middle_name_or_initial": "M.", "count": 1},
        {"row_nr": 2, "middle_name_or_initial": "J.", "count": 1},
        {"row_nr": 3, "middle_name_or_initial": "E.", "count": 1},
        {"row_nr": 3, "middle_name_or_initial": "S.", "count": 1},
        {"row_nr": 4, "middle_name_or_initial": "M.", "count": 1},
        {"row_nr": 4, "middle_name_or_initial": "P.", "count": 1},
        {"row_nr": 5, "middle_name_or_initial": "Christopher", "count": 15},
        {"row_nr": 5, "middle_name_or_initial": "Robert", "count": 1},
        {"row_nr": 6, "middle_name_or_initial": "Lusi", "count": 1},
        {"row_nr": 6, "middle_name_or_initial": "Luis", "count": 1},
    ]
)

df = (
    tmp.with_column(pl.col("count").max().over("row_nr").alias("row_max"))
    .filter(pl.col("count") == pl.col("row_max"))
    .select(["count", "middle_name_or_initial", "row_nr"])
)
df

------

shape: (12, 3)
┌───────┬────────────────────────┬────────┐
│ count ┆ middle_name_or_initial ┆ row_nr │
│ ---   ┆ ---                    ┆ ---    │
│ i64   ┆ str                    ┆ i64    │
╞═══════╪════════════════════════╪════════╡
│ 8     ┆ R.                     ┆ 0      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11    ┆ J.                     ┆ 1      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ Micha                  ┆ 2      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ M.                     ┆ 2      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ J.                     ┆ 2      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ E.                     ┆ 3      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ S.                     ┆ 3      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ M.                     ┆ 4      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ P.                     ┆ 4      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15    ┆ Christopher            ┆ 5      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ Lusi                   ┆ 6      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ Luis                   ┆ 6      │
└───────┴────────────────────────┴────────┘

Here is a way of creating a test dataframe code from the result above:

df.to_dict(as_series=False)
---
{'count': [8, 11, 1, 1, 1, 1, 1, 1, 1, 15, 1, 1],
 'middle_name_or_initial': ['R.',
  'J.',
  'Micha',
  'M.',
  'J.',
  'E.',
  'S.',
  'M.',
  'P.',
  'Christopher',
  'Lusi',
  'Luis'],
 'row_nr': [0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6]}

This can then be written as a test case in your question (in case you have future questions):

df_test = pl.DataFrame(
    {
        "count": [8, 11, 1, 1, 1, 1, 1, 1, 1, 15, 1, 1],
        "middle_name_or_initial": [
            "R.",
            "J.",
            "Micha",
            "M.",
            "J.",
            "E.",
            "S.",
            "M.",
            "P.",
            "Christopher",
            "Lusi",
            "Luis",
        ],
        "row_nr": [0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6],
    }
)
Answered By: TomNorway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.