Grouping dataframe by similar non matching values

Question

If I have a pandas dataframe with the following columns: id, num, amount.

I want to group the dataframe such that all rows in each group have the same id and amount and where each row’s value of num has a value that is not more than 10 larger or smaller the next row’s value of num.

For the same id, if one row to the next does not have the same amount or if the absolute difference between the two num values is more than 10 then it will start a new grouping. Having a row with a different id in the middle does not break a grouping.

How can I go about doing this?

I have not managed to make a grouping where I’m not looking for matching values (like here where I need it to be close – but not matching). I’m assuming that this would need some custom grouping function but I’ve been having trouble putting one together

Example dataframe:

id	amount	num
aaa-aaa	130	12
aaa-aaa	130	39
bbb-bbb	270	41
ccc-ccc	130	19
bbb-bbb	270	37
aaa-aaa	130	42
aaa-aaa	380	39

Expected Groups:

Group 1:

id	amount	num
aaa-aaa	130	12

Group 2:

id	amount	num
aaa-aaa	130	39
aaa-aaa	130	42

Group 3:

id	amount	num
bbb-bbb	270	41
bbb-bbb	270	37

Group 4:

id	amount	num
ccc-ccc	130	19

Group 5:

id	amount	num
aaa-aaa	380	39

Asked By: yem

||

Source

Answer 1

The logic is not fully clear, but assuming you want to start a new group when there is a gap of more than 10:

close = (df.sort_values(by=['amount', 'num'])
           .groupby('amount')
           ['num'].diff().abs().gt(10).cumsum()
         )

for _, g in df.groupby(['amount', close]):
    print(g, end='nn')

Output:

        id  amount  num
0  aaa-aaa     130   12
3  ddd-ddd     130   19

        id  amount  num
1  bbb-bbb     130   39

        id  amount  num
2  ccc-ccc     270   41
4  eee-eee     270   37

how it works:

# sort values by amount/sum
df.sort_values(by=['amount', 'num'])

        id  amount  num
0  aaa-aaa     130   12
3  ccc-ccc     130   19
1  aaa-aaa     130   39
5  aaa-aaa     130   42
4  bbb-bbb     270   37
2  bbb-bbb     270   41
6  aaa-aaa     380   39

# get the absolute successive difference in "num"
(df.sort_values(by=['amount', 'num'])
   .groupby('amount')
   ['num'].diff().abs()
)

0     NaN
3     7.0
1    20.0
5     3.0
4     NaN
2     4.0
6     NaN
Name: num, dtype: float64

# check if it's greater than 10 and cumsum
# to create a grouper for groupby

[...].gt(10).cumsum()

0    0
3    0
1    1
5    1
4    1
2    1
6    1
Name: num, dtype: int64

Answered By: mozway

Answer 2

With sorting by amount and num and adding an auxiliary marker column as difference (that fits a threshold) between consecutive values:

groups = df.sort_values(['amount', 'num'])
         .assign(diff_=lambda x: x['num'].diff().abs().fillna(0).le(10))
         .groupby(['amount', 'diff_'])
for _, g in groups:
    print(g)

         id  amount  num  diff_
1  bbb-bbb      130   39  False
         id  amount  num  diff_
0  aaa-aaa      130   12   True
3  ddd-ddd      130   19   True
         id  amount  num  diff_
4  eee-eee      270   37   True
2  ccc-ccc      270   41   True

Answered By: RomanPerekhrest

Grouping dataframe by similar non matching values

Question:

Answers:

how it works: