Grouping dataframe by similar non matching values
Question:
If I have a pandas dataframe with the following columns: id, num, amount.
I want to group the dataframe such that all rows in each group have the same id and amount and where each row’s value of num has a value that is not more than 10 larger or smaller the next row’s value of num.
For the same id, if one row to the next does not have the same amount or if the absolute difference between the two num values is more than 10 then it will start a new grouping. Having a row with a different id in the middle does not break a grouping.
How can I go about doing this?
I have not managed to make a grouping where I’m not looking for matching values (like here where I need it to be close – but not matching). I’m assuming that this would need some custom grouping function but I’ve been having trouble putting one together
Example dataframe:
id
amount
num
aaa-aaa
130
12
aaa-aaa
130
39
bbb-bbb
270
41
ccc-ccc
130
19
bbb-bbb
270
37
aaa-aaa
130
42
aaa-aaa
380
39
Expected Groups:
Group 1:
id
amount
num
aaa-aaa
130
12
Group 2:
id
amount
num
aaa-aaa
130
39
aaa-aaa
130
42
Group 3:
id
amount
num
bbb-bbb
270
41
bbb-bbb
270
37
Group 4:
id
amount
num
ccc-ccc
130
19
Group 5:
id
amount
num
aaa-aaa
380
39
Answers:
The logic is not fully clear, but assuming you want to start a new group when there is a gap of more than 10:
close = (df.sort_values(by=['amount', 'num'])
.groupby('amount')
['num'].diff().abs().gt(10).cumsum()
)
for _, g in df.groupby(['amount', close]):
print(g, end='nn')
Output:
id amount num
0 aaa-aaa 130 12
3 ddd-ddd 130 19
id amount num
1 bbb-bbb 130 39
id amount num
2 ccc-ccc 270 41
4 eee-eee 270 37
how it works:
# sort values by amount/sum
df.sort_values(by=['amount', 'num'])
id amount num
0 aaa-aaa 130 12
3 ccc-ccc 130 19
1 aaa-aaa 130 39
5 aaa-aaa 130 42
4 bbb-bbb 270 37
2 bbb-bbb 270 41
6 aaa-aaa 380 39
# get the absolute successive difference in "num"
(df.sort_values(by=['amount', 'num'])
.groupby('amount')
['num'].diff().abs()
)
0 NaN
3 7.0
1 20.0
5 3.0
4 NaN
2 4.0
6 NaN
Name: num, dtype: float64
# check if it's greater than 10 and cumsum
# to create a grouper for groupby
[...].gt(10).cumsum()
0 0
3 0
1 1
5 1
4 1
2 1
6 1
Name: num, dtype: int64
With sorting by amount
and num
and adding an auxiliary marker column as difference (that fits a threshold) between consecutive values:
groups = df.sort_values(['amount', 'num'])
.assign(diff_=lambda x: x['num'].diff().abs().fillna(0).le(10))
.groupby(['amount', 'diff_'])
for _, g in groups:
print(g)
id amount num diff_
1 bbb-bbb 130 39 False
id amount num diff_
0 aaa-aaa 130 12 True
3 ddd-ddd 130 19 True
id amount num diff_
4 eee-eee 270 37 True
2 ccc-ccc 270 41 True
If I have a pandas dataframe with the following columns: id, num, amount.
I want to group the dataframe such that all rows in each group have the same id and amount and where each row’s value of num has a value that is not more than 10 larger or smaller the next row’s value of num.
For the same id, if one row to the next does not have the same amount or if the absolute difference between the two num values is more than 10 then it will start a new grouping. Having a row with a different id in the middle does not break a grouping.
How can I go about doing this?
I have not managed to make a grouping where I’m not looking for matching values (like here where I need it to be close – but not matching). I’m assuming that this would need some custom grouping function but I’ve been having trouble putting one together
Example dataframe:
id | amount | num |
---|---|---|
aaa-aaa | 130 | 12 |
aaa-aaa | 130 | 39 |
bbb-bbb | 270 | 41 |
ccc-ccc | 130 | 19 |
bbb-bbb | 270 | 37 |
aaa-aaa | 130 | 42 |
aaa-aaa | 380 | 39 |
Expected Groups:
Group 1:
id | amount | num |
---|---|---|
aaa-aaa | 130 | 12 |
Group 2:
id | amount | num |
---|---|---|
aaa-aaa | 130 | 39 |
aaa-aaa | 130 | 42 |
Group 3:
id | amount | num |
---|---|---|
bbb-bbb | 270 | 41 |
bbb-bbb | 270 | 37 |
Group 4:
id | amount | num |
---|---|---|
ccc-ccc | 130 | 19 |
Group 5:
id | amount | num |
---|---|---|
aaa-aaa | 380 | 39 |
The logic is not fully clear, but assuming you want to start a new group when there is a gap of more than 10:
close = (df.sort_values(by=['amount', 'num'])
.groupby('amount')
['num'].diff().abs().gt(10).cumsum()
)
for _, g in df.groupby(['amount', close]):
print(g, end='nn')
Output:
id amount num
0 aaa-aaa 130 12
3 ddd-ddd 130 19
id amount num
1 bbb-bbb 130 39
id amount num
2 ccc-ccc 270 41
4 eee-eee 270 37
how it works:
# sort values by amount/sum
df.sort_values(by=['amount', 'num'])
id amount num
0 aaa-aaa 130 12
3 ccc-ccc 130 19
1 aaa-aaa 130 39
5 aaa-aaa 130 42
4 bbb-bbb 270 37
2 bbb-bbb 270 41
6 aaa-aaa 380 39
# get the absolute successive difference in "num"
(df.sort_values(by=['amount', 'num'])
.groupby('amount')
['num'].diff().abs()
)
0 NaN
3 7.0
1 20.0
5 3.0
4 NaN
2 4.0
6 NaN
Name: num, dtype: float64
# check if it's greater than 10 and cumsum
# to create a grouper for groupby
[...].gt(10).cumsum()
0 0
3 0
1 1
5 1
4 1
2 1
6 1
Name: num, dtype: int64
With sorting by amount
and num
and adding an auxiliary marker column as difference (that fits a threshold) between consecutive values:
groups = df.sort_values(['amount', 'num'])
.assign(diff_=lambda x: x['num'].diff().abs().fillna(0).le(10))
.groupby(['amount', 'diff_'])
for _, g in groups:
print(g)
id amount num diff_
1 bbb-bbb 130 39 False
id amount num diff_
0 aaa-aaa 130 12 True
3 ddd-ddd 130 19 True
id amount num diff_
4 eee-eee 270 37 True
2 ccc-ccc 270 41 True