Grouping dataframe by similar non matching values

Question:

If I have a pandas dataframe with the following columns: id, num, amount.

I want to group the dataframe such that all rows in each group have the same id and amount and where each row’s value of num has a value that is not more than 10 larger or smaller the next row’s value of num.

For the same id, if one row to the next does not have the same amount or if the absolute difference between the two num values is more than 10 then it will start a new grouping. Having a row with a different id in the middle does not break a grouping.

How can I go about doing this?

I have not managed to make a grouping where I’m not looking for matching values (like here where I need it to be close – but not matching). I’m assuming that this would need some custom grouping function but I’ve been having trouble putting one together

Example dataframe:

id amount num
aaa-aaa 130 12
aaa-aaa 130 39
bbb-bbb 270 41
ccc-ccc 130 19
bbb-bbb 270 37
aaa-aaa 130 42
aaa-aaa 380 39

Expected Groups:

Group 1:

id amount num
aaa-aaa 130 12

Group 2:

id amount num
aaa-aaa 130 39
aaa-aaa 130 42

Group 3:

id amount num
bbb-bbb 270 41
bbb-bbb 270 37

Group 4:

id amount num
ccc-ccc 130 19

Group 5:

id amount num
aaa-aaa 380 39
Asked By: yem

||

Answers:

The logic is not fully clear, but assuming you want to start a new group when there is a gap of more than 10:

close = (df.sort_values(by=['amount', 'num'])
           .groupby('amount')
           ['num'].diff().abs().gt(10).cumsum()
         )

for _, g in df.groupby(['amount', close]):
    print(g, end='nn')

Output:

        id  amount  num
0  aaa-aaa     130   12
3  ddd-ddd     130   19

        id  amount  num
1  bbb-bbb     130   39

        id  amount  num
2  ccc-ccc     270   41
4  eee-eee     270   37
how it works:
# sort values by amount/sum
df.sort_values(by=['amount', 'num'])

        id  amount  num
0  aaa-aaa     130   12
3  ccc-ccc     130   19
1  aaa-aaa     130   39
5  aaa-aaa     130   42
4  bbb-bbb     270   37
2  bbb-bbb     270   41
6  aaa-aaa     380   39

# get the absolute successive difference in "num"
(df.sort_values(by=['amount', 'num'])
   .groupby('amount')
   ['num'].diff().abs()
)

0     NaN
3     7.0
1    20.0
5     3.0
4     NaN
2     4.0
6     NaN
Name: num, dtype: float64

# check if it's greater than 10 and cumsum
# to create a grouper for groupby

[...].gt(10).cumsum()

0    0
3    0
1    1
5    1
4    1
2    1
6    1
Name: num, dtype: int64
Answered By: mozway

With sorting by amount and num and adding an auxiliary marker column as difference (that fits a threshold) between consecutive values:

groups = df.sort_values(['amount', 'num'])
         .assign(diff_=lambda x: x['num'].diff().abs().fillna(0).le(10))
         .groupby(['amount', 'diff_'])
for _, g in groups:
    print(g)

         id  amount  num  diff_
1  bbb-bbb      130   39  False
         id  amount  num  diff_
0  aaa-aaa      130   12   True
3  ddd-ddd      130   19   True
         id  amount  num  diff_
4  eee-eee      270   37   True
2  ccc-ccc      270   41   True
Answered By: RomanPerekhrest
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.