How to compare repeated values ​of the same column of a dataframe in python

Question:

I’m using this dataset to solve a problem ->https://www.kaggle.com/datasets/harlfoxem/housesalesprediction.

I used this code to see repeated values in the column "id": pd.concat(g for _, g in df.groupby("id") if len(g) > 1) and the output is this:

output

what i’m want to do is to compare repeated values, e.g: id 1000102 appear two times, that means it was sold more than once in the range. What I would like to do is take these two entries and compare the sale value in the "price" column to find the biggest variances, but with all the entries. For example sorting from highest to lowest variation.

Asked By: Ismael Pauchner

||

Answers:

In groups by 'id': from the LAST element subtract the FIRST one. This will be the diff value that corresponds to the 'id' of the group.

diffs = df_2.groupby('id')['price'].agg(lambda x: x.iloc[-1] - x.iloc[0] )
diffs.name = 'diff'

You can aggregate with any function here! For example: lambda x: x.max() - x.min().

If you’d like to join the diffs Series as a column to the df:

df.iloc[:, :3].merge(diffs, left_on='id', right_index=True)

Other

This question provides a better version of leaving only groups with more than 1 element:

df_2 = df.groupby('id').filter(lambda x: len(x) > 1)
Answered By: Vladimir Fokow
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.