Find the values in second array, based on the condition of first array. What's the most efficient way?
Question:
Suppose we have two arrays with the same shape. Using the uniques element in arr_1
, I want to find the corresponding values in arr_2
. And build a dictionary as output.
(arr_1
is sorted. arr_2
is NOT sorted).
Here is an example.
arr_1 = np.array([1,1,1,2,2,3]) # the index array, sorted
arr_2 = np.array([16,11,12,13,14,15]) # find the values, not sorted
target_dict = {1:[16,11,12], 2:[13,14], 3:[15]}
My solution with dict comprehension:
I wrote the following code:
target_dict = {i: arr_2[np.where(arr_1 == i)].tolist() for i in np.unique(arr_1)}
However, both arr_1
and arr_2
have more than 4B elements in my case. Hence, the code above can take more than 100 hours to finish.
May I ask is there any more efficient way to accomplish it? Thank you so much in advance!
Answers:
Not sure why you’re using a where
clause. You can do a single pass over both lists simultaneously to map the elements. Using a defaultdict
here for simplicity.
from collections import defaultdict
arr_1 = [1,1,1,2,2,3]
arr_2 = [16,11,12,13,14,15]
target = defaultdict(list)
for ind, element in enumerate(arr_1):
target[element].append(arr_2[ind])
print(dict(target))
# prints {1: [16, 11, 12], 2: [13, 14], 3: [15]}
Since this approach has linear complexity, it should be efficient enough to handle your use case.
Suppose we have two arrays with the same shape. Using the uniques element in arr_1
, I want to find the corresponding values in arr_2
. And build a dictionary as output.
(arr_1
is sorted. arr_2
is NOT sorted).
Here is an example.
arr_1 = np.array([1,1,1,2,2,3]) # the index array, sorted
arr_2 = np.array([16,11,12,13,14,15]) # find the values, not sorted
target_dict = {1:[16,11,12], 2:[13,14], 3:[15]}
My solution with dict comprehension:
I wrote the following code:
target_dict = {i: arr_2[np.where(arr_1 == i)].tolist() for i in np.unique(arr_1)}
However, both arr_1
and arr_2
have more than 4B elements in my case. Hence, the code above can take more than 100 hours to finish.
May I ask is there any more efficient way to accomplish it? Thank you so much in advance!
Not sure why you’re using a where
clause. You can do a single pass over both lists simultaneously to map the elements. Using a defaultdict
here for simplicity.
from collections import defaultdict
arr_1 = [1,1,1,2,2,3]
arr_2 = [16,11,12,13,14,15]
target = defaultdict(list)
for ind, element in enumerate(arr_1):
target[element].append(arr_2[ind])
print(dict(target))
# prints {1: [16, 11, 12], 2: [13, 14], 3: [15]}
Since this approach has linear complexity, it should be efficient enough to handle your use case.