Find indices of rows of numpy 2d array with float data in another 2D array
Question:
This post helped to achieve what I wanted but the implementation takes longer for some large datasets I work onNumPyhave two NumPy arrays (fairly large):
p[:24]=array([[ 0.18264738, -0.00326727, 0.01799096],
[ 0.18198644, -0.00051316, 0.01800063],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18215604, 0.00157497, 0.01799999],
[ 0.18286349, 0.0036474 , 0.01799824],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18399446, 0.00528562, 0.01799998],
[ 0.18573835, 0.0068323 , 0.01799908],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18573835, 0.0068323 , 0.01799908],
[ 0.18744153, 0.00758001, 0.018 ],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18744153, 0.00758001, 0.018 ],
[ 0.18956973, 0.00801727, 0.01800126],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.19157426, 0.0078435 , 0.018 ],
[ 0.19366005, 0.00714792, 0.01800038],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.19584496, 0.0055142 , 0.01799665],
[ 0.19701494, 0.00384344, 0.01800058],
[ 0.19366005, 0.00714792, 0.01800038],
[ 0.19584496, 0.0055142 , 0.01799665],
[ 0.18999948, 0. , 0.0226188 ]]
v[:24]=array([[ 0.18264738, -0.00326727, 0.01799096],
[ 0.18198644, -0.00051316, 0.01800063],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18215604, 0.00157497, 0.01799999],
[ 0.18286349, 0.0036474 , 0.01799824],
[ 0.18399446, 0.00528562, 0.01799998],
[ 0.18573835, 0.0068323 , 0.01799908],
[ 0.18744153, 0.00758001, 0.018 ],
[ 0.18956973, 0.00801727, 0.01800126],
[ 0.19157426, 0.0078435 , 0.018 ],
[ 0.19366005, 0.00714792, 0.01800038],
[ 0.19584496, 0.0055142 , 0.01799665],
[ 0.19701494, 0.00384344, 0.01800058],
[ 0.19775054, 0.0019907 , 0.01800372],
[ 0.19800517, -0.00065405, 0.01800135],
[ 0.19731225, -0.00330035, 0.01799999],
[ 0.19596213, -0.00537427, 0.01800001],
[ 0.18937038, -0.00797523, 0.018 ],
[ 0.18739267, -0.00759293, 0.01799974],
[ 0.18565072, -0.00671446, 0.018 ],
[ 0.18411626, -0.00545196, 0.01800367],
[ 0.19136006, -0.00791202, 0.01799961],
[ 0.1938769 , -0.00702934, 0.01799973],
[ 0.1314003 , -0.06724723, 0.0645 ]])
v array is generated from p array using:
p_uniques, p_indices, p_inverse, p_counts = np.unique(
p, return_index=True,
return_inverse=True,
return_counts=True,
axis=0)
v = p[np.sort(p_indices, axis=None)]
Now, the target is to generate an array containing the indices/occurrences of elements of the v array in the p array including duplicates. Therefore, the desired output would be:
indices[:24]=array([ 0, 1, 2, 3, 4, 2, 5, 6, 2, 6, 7, 2,
7, 8, 2, 9, 10, 2, 2, 11, 12, 10, 11, 2])
I just posted the first 24 indices from the indices array to save space.
I tried various methods using np.where, np.isin, and others but I could not achieve the desired result with better performance over the solution shared in the linked post.
I’d greatly appreciate your help.
Answers:
The key insight here is that v
is a permutation of p_uniques
and np.argsort(p_indices)
provides this permutation. Inverting this permutation gives us the mapping that we have to apply to p_inverse
to get what we want.
To invert the permutation, we use the code from How to invert a permutation array in numpy
# p_indices: len(v), range(0, len(p)). Maps v indices to p indices
# p_inverse: len(p), range(0, len(v)). Maps p indices to p_unique indices
p_uniques, p_indices, p_inverse = np.unique(
p, return_index=True, return_inverse=True, axis=0)
# len(v), range(0, len(v)). Maps v indices to p_unique indices
sort_permut = np.argsort(p_indices)
v = p_uniques[sort_permut]
# len(v), range(0, len(v)). Maps p_unique indices to v indices
inv_sort = np.empty_like(sort_permut)
inv_sort[sort_permut] = np.arange(len(inv_sort))
# len(p), range(0, len(v)). Maps p indices to v indices
indices = inv_sort[p_inverse]
This post helped to achieve what I wanted but the implementation takes longer for some large datasets I work onNumPyhave two NumPy arrays (fairly large):
p[:24]=array([[ 0.18264738, -0.00326727, 0.01799096],
[ 0.18198644, -0.00051316, 0.01800063],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18215604, 0.00157497, 0.01799999],
[ 0.18286349, 0.0036474 , 0.01799824],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18399446, 0.00528562, 0.01799998],
[ 0.18573835, 0.0068323 , 0.01799908],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18573835, 0.0068323 , 0.01799908],
[ 0.18744153, 0.00758001, 0.018 ],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18744153, 0.00758001, 0.018 ],
[ 0.18956973, 0.00801727, 0.01800126],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.19157426, 0.0078435 , 0.018 ],
[ 0.19366005, 0.00714792, 0.01800038],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.19584496, 0.0055142 , 0.01799665],
[ 0.19701494, 0.00384344, 0.01800058],
[ 0.19366005, 0.00714792, 0.01800038],
[ 0.19584496, 0.0055142 , 0.01799665],
[ 0.18999948, 0. , 0.0226188 ]]
v[:24]=array([[ 0.18264738, -0.00326727, 0.01799096],
[ 0.18198644, -0.00051316, 0.01800063],
[ 0.18999948, 0. , 0.0226188 ],
[ 0.18215604, 0.00157497, 0.01799999],
[ 0.18286349, 0.0036474 , 0.01799824],
[ 0.18399446, 0.00528562, 0.01799998],
[ 0.18573835, 0.0068323 , 0.01799908],
[ 0.18744153, 0.00758001, 0.018 ],
[ 0.18956973, 0.00801727, 0.01800126],
[ 0.19157426, 0.0078435 , 0.018 ],
[ 0.19366005, 0.00714792, 0.01800038],
[ 0.19584496, 0.0055142 , 0.01799665],
[ 0.19701494, 0.00384344, 0.01800058],
[ 0.19775054, 0.0019907 , 0.01800372],
[ 0.19800517, -0.00065405, 0.01800135],
[ 0.19731225, -0.00330035, 0.01799999],
[ 0.19596213, -0.00537427, 0.01800001],
[ 0.18937038, -0.00797523, 0.018 ],
[ 0.18739267, -0.00759293, 0.01799974],
[ 0.18565072, -0.00671446, 0.018 ],
[ 0.18411626, -0.00545196, 0.01800367],
[ 0.19136006, -0.00791202, 0.01799961],
[ 0.1938769 , -0.00702934, 0.01799973],
[ 0.1314003 , -0.06724723, 0.0645 ]])
v array is generated from p array using:
p_uniques, p_indices, p_inverse, p_counts = np.unique(
p, return_index=True,
return_inverse=True,
return_counts=True,
axis=0)
v = p[np.sort(p_indices, axis=None)]
Now, the target is to generate an array containing the indices/occurrences of elements of the v array in the p array including duplicates. Therefore, the desired output would be:
indices[:24]=array([ 0, 1, 2, 3, 4, 2, 5, 6, 2, 6, 7, 2,
7, 8, 2, 9, 10, 2, 2, 11, 12, 10, 11, 2])
I just posted the first 24 indices from the indices array to save space.
I tried various methods using np.where, np.isin, and others but I could not achieve the desired result with better performance over the solution shared in the linked post.
I’d greatly appreciate your help.
The key insight here is that v
is a permutation of p_uniques
and np.argsort(p_indices)
provides this permutation. Inverting this permutation gives us the mapping that we have to apply to p_inverse
to get what we want.
To invert the permutation, we use the code from How to invert a permutation array in numpy
# p_indices: len(v), range(0, len(p)). Maps v indices to p indices
# p_inverse: len(p), range(0, len(v)). Maps p indices to p_unique indices
p_uniques, p_indices, p_inverse = np.unique(
p, return_index=True, return_inverse=True, axis=0)
# len(v), range(0, len(v)). Maps v indices to p_unique indices
sort_permut = np.argsort(p_indices)
v = p_uniques[sort_permut]
# len(v), range(0, len(v)). Maps p_unique indices to v indices
inv_sort = np.empty_like(sort_permut)
inv_sort[sort_permut] = np.arange(len(inv_sort))
# len(p), range(0, len(v)). Maps p indices to v indices
indices = inv_sort[p_inverse]