pandas – ranking with tolerance?
Question:
Is there a way to rank values in a dataframe but considering a tolerance?
Say I have the following values
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
and if I ran rank:
ex.rank(method='average')
0 2.0
1 3.0
2 1.0
3 6.0
4 5.0
5 4.0
dtype: float64
But what I’d like as a result would be (with a tolereance of 0.01):
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
Any way to define this tolerance?
Thanks
Answers:
You could do some sort of min-max scaling i.e.
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
# You scale the values to be between 0 and 1
ex_scaled = (ex - min(ex)) / (max(ex) - min(ex))
# You put them on a scale from 1 to the length of your series
result = ex_scaled * len(ex) + 1
# result
0 1.335347
1 4.444109
2 1.000000
3 7.000000
4 4.969789
5 4.453172
That way you are still ranking, but values closer to each other have ranks close to each other
You can sort the values, merge the close ones and rank on that:
s = ex.drop_duplicates().sort_values()
mapper = (s.groupby(s.diff().abs().gt(0.011).cumsum(), sort=False)
.transform('mean')
.reindex_like(ex)
)
out = mapper.rank(method='average')
N.B. I used 0.011
as threshold as floating point arithmetics does not always enable enough precision to detect a value clode to the threshold
output:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
intermediate mapper
:
0 16.520
1 19.955
2 16.150
3 22.770
4 20.530
5 19.955
dtype: float64
This function may works:
def rank_with_tolerance(sr, tolerance=0.01+1e-10, method='average'):
vals = pd.Series(sr.unique()).sort_values()
vals.index = vals
vals = vals.mask(vals - vals.shift(1) <= tolerance, vals.shift(1))
return sr.map(vals).fillna(sr).rank(method=method)
It works for your given input:
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
And with more complex sets it seems to work too:
ex = pd.Series([16.52,19.95,19.96, 19.95, 19.97, 19.97, 19.98])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 1.0
1 3.0
2 3.0
3 3.0
4 5.5
5 5.5
6 7.0
dtype: float64
Is there a way to rank values in a dataframe but considering a tolerance?
Say I have the following values
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
and if I ran rank:
ex.rank(method='average')
0 2.0
1 3.0
2 1.0
3 6.0
4 5.0
5 4.0
dtype: float64
But what I’d like as a result would be (with a tolereance of 0.01):
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
Any way to define this tolerance?
Thanks
You could do some sort of min-max scaling i.e.
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
# You scale the values to be between 0 and 1
ex_scaled = (ex - min(ex)) / (max(ex) - min(ex))
# You put them on a scale from 1 to the length of your series
result = ex_scaled * len(ex) + 1
# result
0 1.335347
1 4.444109
2 1.000000
3 7.000000
4 4.969789
5 4.453172
That way you are still ranking, but values closer to each other have ranks close to each other
You can sort the values, merge the close ones and rank on that:
s = ex.drop_duplicates().sort_values()
mapper = (s.groupby(s.diff().abs().gt(0.011).cumsum(), sort=False)
.transform('mean')
.reindex_like(ex)
)
out = mapper.rank(method='average')
N.B. I used 0.011
as threshold as floating point arithmetics does not always enable enough precision to detect a value clode to the threshold
output:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
intermediate mapper
:
0 16.520
1 19.955
2 16.150
3 22.770
4 20.530
5 19.955
dtype: float64
This function may works:
def rank_with_tolerance(sr, tolerance=0.01+1e-10, method='average'):
vals = pd.Series(sr.unique()).sort_values()
vals.index = vals
vals = vals.mask(vals - vals.shift(1) <= tolerance, vals.shift(1))
return sr.map(vals).fillna(sr).rank(method=method)
It works for your given input:
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
And with more complex sets it seems to work too:
ex = pd.Series([16.52,19.95,19.96, 19.95, 19.97, 19.97, 19.98])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 1.0
1 3.0
2 3.0
3 3.0
4 5.5
5 5.5
6 7.0
dtype: float64