Select rows of numpy array based on column values

Question:

I’m trying to obtain the rows of a numpy array based on the values of the columns. Basically, if the value of the column is within a predefined list, I want to obtain that row. I’ll leave an example below.

This is my array:

myArray = np.array([[1,55,4],
                     [2,2,3],
                     [3,90,2],
                     [4,65,1]])

These are the desired values:

desiredValues = [2,3,4]

And I want to obtain all the rows of the array (myArray) for which the value of the first column is in the list (desiredValues). Obtaining the following array:

desiredArray([[2,2,3],
              [3,90,2],
              [4,65,1]])

I’ve done some research and for specific values (Numpy select rows based on condition), or within a range (Numpy array: How to extract whole rows based on values in a column), I know this is easily done, but for values within a list I wasn’t able to reach any conclusion.

Asked By: Paulo Nascimento

||

Answers:

You can use np.isin() to check which elements in the first column are in the desired values list. You can then use the output as a mask.

import numpy as np
my_array = np.array([[1,55,4],
                     [2,2,3],
                     [3,90,2],
                     [4,65,1]])
desired_values = np.array([2,3,4])
mask = np.isin(element = my_array[:,0],test_elements = desired_values)
desired_array = my_array[mask]
print(desired_array)

output

array([[ 2,  2,  3],
       [ 3, 90,  2],
       [ 4, 65,  1]])

Edit: numpy.isin vs for-loop

@Furas suggested a for-loop solution. That approach works and is perhaps more intuitive, at least in that one does not have to research the many esoteric functions of Numpy.

My first reaction is to emphasize that Numpy operations tend to be faster than for-loops. However, the details are a little more nuanced. For-loops seem to be slightly faster for small arrays, but their time costs increase significantly faster as the size of the array increases.

Comparisons for relatively small test arrays

The first graph compares the computation times for np.isin() to those from the for-loop. The number of rows in my_array are presented along the x-axis. Their values are 2, 4, 6, … 32.

enter image description here

The above graph shows that the computation time for the for-loop seems to increase linearly with the growth of the tested array. The for-loop is faster until the number of rows is approximately 10.

Comparisons for larger test arrays

The second graph shows a comparison similar to the first graph, but examines computation times when the number of rows in my_array is 2, 4, 8, 16, … 1024.

enter image description here
The above graph shows that np.isin() is significantly faster and more appropriate for larger problems.

Code for reproduction

The data to recreate the above graphs may be generated with the following code.

import numpy as np

count_list = [2**x for x in range(2,11)]
isin_time_means = []
loop_time_means = []

for count in count_list:
  my_array = np.random.randint(low=-10,high=10,size=(count,5))
  desired_values = np.random.randint(low=-10,high = 10,size=(10,))
  a = %timeit -o np.isin(my_array[:,0],desired_values)
  b = %timeit -o [x in desired_values for x in my_array[:,0]]
  isin_time_means.append(np.mean(a.timings))
  loop_time_means.append(np.mean(b.timings))
Answered By: Juancheeto

You can always use for-loop to check every value separatelly

mask = [x in desiredValues for x in myArray[:,0]]

desired_array = myArray[mask]

Full code:

import numpy as np

myArray = np.array([[1,55,4],
                     [2,2,3],
                     [3,90,2],
                     [4,65,1]])

desiredValues = [2,3,4]

mask = [x in desiredValues for x in myArray[:,0]]

desired_array = myArray[mask]

print(desired_array)
Answered By: furas
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.