How to assign numeric labels to all elements in a list/series/array based on numbers from a different list?
Question:
I have two lists that contains two series of numbers, such as:
A = [1.0, 2.9, 3.4, 4.2, 5.5....100.3]
B = [1.1, 1.2, 1.3, 2.5, 3.0, 3.1, 5.2]
I would like to create another list of labels based on whether the elements in list B falls in an (any) interval from list A. Something like this:
C = [group_1, group_1, group_1, group_1, group_2, group_2, group_3]
i.e. 1.1, 1.2, 1.3, 2.5 all fall in the interval of 1.0 – 2.9 from list A, hence group_1; 3.0, 3.1 both fall in the interval of 2.9 – 3.4, hence group_2; and 5.2 falls in the interval of 4.2 – 5.5, hence group_3, etc..
It doesn’t matter which interval from list A does the number from list B fall in, the point is to group/label all elements in list B in a consecutive manner.
The orginal data is large so it would be impossible to manually assign labels/groups to elements in list B. Any help is appreciated.
Answers:
So, assuming A
is sorted, you can use binary search, which already comes with the python standard library in the (rather clunky) module bisect
:
>>> A = [1.0, 2.9, 3.4, 4.2, 5.5]
>>> B = [1.1, 1.2, 1.3, 2.5, 3.0, 3.1, 5.2]
>>> [bisect.bisect_left(A, b) for b in B]
[1, 1, 1, 1, 2, 2, 4]
This takes O(N * logN)
time.
Note, take care to read the documentation and how bisect_left
and bisect_right
behave when a value in B
is equal to a value in A
, and how items that wouldn’t fall anywhere behave.
You can try this for O(n)
solution (assuming both lists are sorted and one number must be in one of the intervals in A
):
A = [1.0, 2.9, 3.4, 4.2, 5.5, 100.3]
B = [1.1, 1.2, 1.3, 2.5, 3.0, 3.1, 5.2]
grp = 0
i1, i2 = iter(A), iter(B)
a, b = next(i1), next(i2)
out = []
while True:
try:
if a < b:
a = next(i1)
grp += 1
else:
out.append(grp)
b = next(i2)
except StopIteration:
break
print(out)
Prints:
[1, 1, 1, 1, 2, 2, 4]
I think itertools.groupby
with a tiny mutable "key function" would fit nicely (especially if requirements may change, or if you need to use this pattern elsewhere):
import itertools
class ThresholdIndexer:
"""Callable that returns the index of the last threshold <= arg.
Preconditions:
- thresholds is sorted and not empty.
- For all calls, `thresholds[0] <= call[i].arg <= thresholds[-1]`.
- For all calls, `call[i - 1].arg <= call[i].arg`.
"""
def __init__(self, thresholds):
self.thresholds = thresholds
self.i = 0
def __call__(self, arg):
while not (self.thresholds[self.i] <= arg <= self.thresholds[self.i + 1]):
self.i += 1
return self.i
A = [1.0, 2.9, 3.4, 4.2, 5.5, 100.3]
B = [1.1, 1.2, 1.3, 2.5, 3.0, 3.1, 5.2]
for group_key, group_items in itertools.groupby(B, key=ThresholdIndexer(A)):
print(f'{group_key}: {", ".join(str(i) for i in group_items)}')
"""Output:
0: 1.1, 1.2, 1.3, 2.5
1: 3.0, 3.1
3: 5.2
"""
This approach is O(NA + NB).
You can remove these preconditions by binary-searching for the correct index in __call__
, rather than assuming some latter index will "definitely" be correct. However, the complexity would bump up to O(NB × log NA).
You can answer it in O(len(B))
according to this code:
C= [0]*len(B)
i, j = 0, 0
while i < len(B):
if (B[i] > A[j] and B[i] < A[j+1]):
C[i] = j
i += 1
else:
j += 1
try this:
import numpy as np
A = [1.0, 2.9, 3.4, 4.2, 5.5, 100.3]
B = [1.1, 1.2, 1.3, 2.5, 3.0, 3.1, 5.2]
A_arr = np.array(A)
B_arr = np.array(B)
C = [np.searchsorted(A_arr, b) for b in B_arr]
print(C)
>>>
[1, 1, 1, 1, 2, 2, 4]
I have two lists that contains two series of numbers, such as:
A = [1.0, 2.9, 3.4, 4.2, 5.5....100.3]
B = [1.1, 1.2, 1.3, 2.5, 3.0, 3.1, 5.2]
I would like to create another list of labels based on whether the elements in list B falls in an (any) interval from list A. Something like this:
C = [group_1, group_1, group_1, group_1, group_2, group_2, group_3]
i.e. 1.1, 1.2, 1.3, 2.5 all fall in the interval of 1.0 – 2.9 from list A, hence group_1; 3.0, 3.1 both fall in the interval of 2.9 – 3.4, hence group_2; and 5.2 falls in the interval of 4.2 – 5.5, hence group_3, etc..
It doesn’t matter which interval from list A does the number from list B fall in, the point is to group/label all elements in list B in a consecutive manner.
The orginal data is large so it would be impossible to manually assign labels/groups to elements in list B. Any help is appreciated.
So, assuming A
is sorted, you can use binary search, which already comes with the python standard library in the (rather clunky) module bisect
:
>>> A = [1.0, 2.9, 3.4, 4.2, 5.5]
>>> B = [1.1, 1.2, 1.3, 2.5, 3.0, 3.1, 5.2]
>>> [bisect.bisect_left(A, b) for b in B]
[1, 1, 1, 1, 2, 2, 4]
This takes O(N * logN)
time.
Note, take care to read the documentation and how bisect_left
and bisect_right
behave when a value in B
is equal to a value in A
, and how items that wouldn’t fall anywhere behave.
You can try this for O(n)
solution (assuming both lists are sorted and one number must be in one of the intervals in A
):
A = [1.0, 2.9, 3.4, 4.2, 5.5, 100.3]
B = [1.1, 1.2, 1.3, 2.5, 3.0, 3.1, 5.2]
grp = 0
i1, i2 = iter(A), iter(B)
a, b = next(i1), next(i2)
out = []
while True:
try:
if a < b:
a = next(i1)
grp += 1
else:
out.append(grp)
b = next(i2)
except StopIteration:
break
print(out)
Prints:
[1, 1, 1, 1, 2, 2, 4]
I think itertools.groupby
with a tiny mutable "key function" would fit nicely (especially if requirements may change, or if you need to use this pattern elsewhere):
import itertools
class ThresholdIndexer:
"""Callable that returns the index of the last threshold <= arg.
Preconditions:
- thresholds is sorted and not empty.
- For all calls, `thresholds[0] <= call[i].arg <= thresholds[-1]`.
- For all calls, `call[i - 1].arg <= call[i].arg`.
"""
def __init__(self, thresholds):
self.thresholds = thresholds
self.i = 0
def __call__(self, arg):
while not (self.thresholds[self.i] <= arg <= self.thresholds[self.i + 1]):
self.i += 1
return self.i
A = [1.0, 2.9, 3.4, 4.2, 5.5, 100.3]
B = [1.1, 1.2, 1.3, 2.5, 3.0, 3.1, 5.2]
for group_key, group_items in itertools.groupby(B, key=ThresholdIndexer(A)):
print(f'{group_key}: {", ".join(str(i) for i in group_items)}')
"""Output:
0: 1.1, 1.2, 1.3, 2.5
1: 3.0, 3.1
3: 5.2
"""
This approach is O(NA + NB).
You can remove these preconditions by binary-searching for the correct index in __call__
, rather than assuming some latter index will "definitely" be correct. However, the complexity would bump up to O(NB × log NA).
You can answer it in O(len(B))
according to this code:
C= [0]*len(B)
i, j = 0, 0
while i < len(B):
if (B[i] > A[j] and B[i] < A[j+1]):
C[i] = j
i += 1
else:
j += 1
try this:
import numpy as np
A = [1.0, 2.9, 3.4, 4.2, 5.5, 100.3]
B = [1.1, 1.2, 1.3, 2.5, 3.0, 3.1, 5.2]
A_arr = np.array(A)
B_arr = np.array(B)
C = [np.searchsorted(A_arr, b) for b in B_arr]
print(C)
>>>
[1, 1, 1, 1, 2, 2, 4]