Difference between 2 Polars dataframes

Question:

What is the best way to find the differences between 2 Polars dataframes?
The frame_equal method tells me if there is a difference, I want to find where is the difference.

Example:

import polars as pl

df1 = pl.DataFrame([
    {'id': 1,'col1': ['a',None],'col2': ['x']},
    {'id': 2,'col1': ['b'],'col2': ['y', None]},
    {'id': 3,'col1': [None],'col2': ['z']}]
)

┌─────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        │
│ --- ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╡
│ 1   ┆ ["a", null] ┆ ["x"]       │
│ 2   ┆ ["b"]       ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┘


df2 = pl.DataFrame([
    {'id': 1,'col1': ['a'],'col2': ['x']},
    {'id': 2,'col1': ['b', None],'col2': ['y', None]},
    {'id': 3,'col1': [None],'col2': ['z']}]
)

┌─────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        │
│ --- ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╡
│ 1   ┆ ["a"]       ┆ ["x"]       │
│ 2   ┆ ["b", null] ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┘

The difference in the example is for id = 1 and id = 2.

I can join the dataframes:

df1.join(df2, on='id', suffix='_df2')
┌─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        ┆ col1_df2    ┆ col2_df2    │
│ --- ┆ ---         ┆ ---         ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1   ┆ ["a", null] ┆ ["x"]       ┆ ["a"]       ┆ ["x"]       │
│ 2   ┆ ["b"]       ┆ ["y", null] ┆ ["b", null] ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┴─────────────┴─────────────┘

Expected result

I would like to either:

  • add a boolean columns that shows True in the rows with a difference
  • filter and only display rows with a difference.

The example has only 2 columns, but there are more columns in the dataframe.

Asked By: Luca

||

Answers:

Here’s the filter approach

(
    df1.join(df2, on='id', suffix='_df2') 
    .filter(pl.any_horizontal(
        pl.col(x).ne_missing(pl.col(f"{x}_df2"))
            for x in df1.columns if x!='id'
        ))
)

If you wanted the bool column then you just change the filter to with_columns and add an alias.

(
    df1.join(df2, on='id', suffix='_df2')
    .with_columns(
        has_diff = pl.any_horizontal(
            pl.col(x).ne_missing(pl.col(f"{x}_df2"))
            for x in df1.columns if x!='id'
            )
        )
)

This assumes that each df has all the same columns other than ‘id’.

Answered By: Dean MacGregor

You can also use == to compare frames:

>>> df1 == df2
shape: (3, 3)
┌──────┬───────┬──────┐
│ id   | col1  | col2 │
│ ---  | ---   | ---  │
│ bool | bool  | bool │
╞══════╪═══════╪══════╡
│ true | false | true │
│ true | false | true │
│ true | true  | true │
└──────┴───────┴──────┘
Answered By: jqurious
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.