determine the range of a value using a look up table

Question:

I have a df with numbers:

numbers = pd.DataFrame(columns=['number'], data=[
50,
65,
75,
85,
90
])

and a df with ranges (look up table):

ranges = pd.DataFrame(
columns=['range','range_min','range_max'],
data=[
['A',90,100],
['B',85,95],
['C',70,80]
]
)

I want to determine what range (in second table) a value (in the first table) falls in. Please note ranges overlap, and limits are inclusive.
Also please note the vanilla dataframe above has 3 ranges, however this dataframe gets generated dynamically. It could have from 2 to 7 ranges.

Desired result:

numbers = pd.DataFrame(columns=['number','detected_range'], data=[
[50,'out_of_range'],
[65, 'out_of_range'],
[75,'C'],
[85,'B'],
[90,'overlap'] * could be A or B *
])

I solved this with a for loop but this doesn’t scale well to a big dataset I am using. Also code is too extensive and inelegant. See below:

numbers['detected_range'] = nan
for i, row1 in number.iterrows():
    for j, row2 in ranges.iterrows():
        if row1.number<row2.range_min and row1.number>row2.range_max:
             numbers.loc[i,'detected_range'] = row1.loc[j,'range']
        else if (other cases...):
              ...and so on...

How could I do this?

Asked By: Pab

||

Answers:

You can use a bit of numpy vectorial operations to generate masks, and use them to select your labels:

import numpy as np

a = numbers['number'].values   # numpy array of numbers
r = ranges.set_index('range')  # dataframe of min/max with labels as index

m1 = (a>=r['range_min'].values[:,None]).T  # is number above each min
m2 = (a<r['range_max'].values[:,None]).T   # is number below each max
m3 = (m1&m2)                               # combine both conditions above
# NB. the two operations could be done without the intermediate variables m1/m2

m4 = m3.sum(1)                             # how many matches?
                                           # 0 -> out_of_range
                                           # 2 -> overlap
                                           # 1 -> get column name

# now we select the label according to the conditions
numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
                                      ['out_of_range', 'overlap'],
                                      # otherwise get column name
                                      default=np.take(r.index, m3.argmax(1))
                                     )

output:

   number detected_range
0      50   out_of_range
1      65   out_of_range
2      75              C
3      85              B
4      90        overlap

edit:

It works with any number of intervals in ranges

example output with extra['D',50,51]:

   number detected_range
0      50              D
1      65   out_of_range
2      75              C
3      85              B
4      90        overlap
Answered By: mozway

Pandas IntervalIndex fits in here; however, since your data has overlapping points, a for loop is the approach I’ll use here (for unique, non-overlapping indices, pd.get_indexer is a fast approach):

intervals = pd.IntervalIndex.from_arrays(ranges.range_min, 
                                         ranges.range_max, 
                                         closed='both')

box = []
for num in numbers.number:
    bools = intervals.contains(num)
    if bools.sum()==1:
        box.append(ranges.range[bools].item())
    elif bools.sum() > 1:
        box.append('overlap')
    else:
        box.append('out_of_range')

numbers.assign(detected_range = box)
 
   number detected_range
0      50   out_of_range
1      65   out_of_range
2      75              C
3      85              B
4      90        overlap

Answered By: sammywemmy

firstly,explode the ranges:

df1=ranges.assign(col1=ranges.apply(lambda ss:range(ss.range_min,ss.range_max),axis=1)).explode('col1')
df1

range  range_min  range_max col1
0     A         90        100   90
0     A         90        100   91
0     A         90        100   92
0     A         90        100   93
0     A         90        100   94
0     A         90        100   95
0     A         90        100   96
0     A         90        100   97
0     A         90        100   98
0     A         90        100   99
1     B         85         95   85
1     B         85         95   86
1     B         85         95   87
1     B         85         95   88
1     B         85         95   89
1     B         85         95   90

secondly,judge wether each of numbers in first df

def function1(x):
    df11=df1.loc[df1.col1==x]
    if len(df11)==0:
        return 'out_of_range'
    if len(df11)>1:
        return 'overlap'
    return df11.iloc[0,0]

numbers.assign(col2=numbers.number.map(function1))

  number          col2
0      50  out_of_range
1      65  out_of_range
2      75             C
3      85             B
4      90       overlap

the logic is simple and clear

Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.