How to select inverse of indexes of a numpy array?
Question:
I have a large set of data in which I need to compare the distances of a set of samples from this array with all the other elements of the array. Below is a very simple example of my data set.
import numpy as np
import scipy.spatial.distance as sd
data = np.array(
[[ 0.93825827, 0.26701143],
[ 0.99121108, 0.35582816],
[ 0.90154837, 0.86254049],
[ 0.83149103, 0.42222948],
[ 0.27309625, 0.38925281],
[ 0.06510739, 0.58445673],
[ 0.61469637, 0.05420098],
[ 0.92685408, 0.62715114],
[ 0.22587817, 0.56819403],
[ 0.28400409, 0.21112043]]
)
sample_indexes = [1,2,3]
# I'd rather not make this
other_indexes = list(set(range(len(data))) - set(sample_indexes))
sample_data = data[sample_indexes]
other_data = data[other_indexes]
# compare them
dists = sd.cdist(sample_data, other_data)
Is there a way to index a numpy array for indexes that are NOT the sample indexes? In my above example I make a list called other_indexes. I’d rather not have to do this for various reasons (large data set, threading, a very VERY low amount of memory on the system this is running on etc. etc. etc.). Is there a way to do something like..
other_data = data[ indexes not in sample_indexes]
I read that numpy masks can do this but I tried…
other_data = data[~sample_indexes]
And this gives me an error. Do I have to create a mask?
Answers:
mask = np.ones(len(data), np.bool)
mask[sample_indexes] = 0
other_data = data[mask]
not the most elegant for what perhaps should be a single-line statement, but its fairly efficient, and the memory overhead is minimal too.
If memory is your prime concern, np.delete would avoid the creation of the mask, and fancy-indexing creates a copy anyway.
On second thought; np.delete does not modify the existing array, so its pretty much exactly the single line statement you are looking for.
I’m not familiar with the specifics on numpy
, but here’s a general solution. Suppose you have the following list:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
.
You create another list of indices you don’t want:
inds = [1, 3, 6]
.
Now simply do this:
good_data = [x for x in a if x not in inds]
, resulting in good_data = [0, 2, 4, 5, 7, 8, 9]
.
You may want to try in1d
In [5]:
select = np.in1d(range(data.shape[0]), sample_indexes)
In [6]:
print data[select]
[[ 0.99121108 0.35582816]
[ 0.90154837 0.86254049]
[ 0.83149103 0.42222948]]
In [7]:
print data[~select]
[[ 0.93825827 0.26701143]
[ 0.27309625 0.38925281]
[ 0.06510739 0.58445673]
[ 0.61469637 0.05420098]
[ 0.92685408 0.62715114]
[ 0.22587817 0.56819403]
[ 0.28400409 0.21112043]]
You may also use setdiff1d
:
In [11]: data[np.setdiff1d(np.arange(data.shape[0]), sample_indexes)]
Out[11]:
array([[ 0.93825827, 0.26701143],
[ 0.27309625, 0.38925281],
[ 0.06510739, 0.58445673],
[ 0.61469637, 0.05420098],
[ 0.92685408, 0.62715114],
[ 0.22587817, 0.56819403],
[ 0.28400409, 0.21112043]])
I have a large set of data in which I need to compare the distances of a set of samples from this array with all the other elements of the array. Below is a very simple example of my data set.
import numpy as np
import scipy.spatial.distance as sd
data = np.array(
[[ 0.93825827, 0.26701143],
[ 0.99121108, 0.35582816],
[ 0.90154837, 0.86254049],
[ 0.83149103, 0.42222948],
[ 0.27309625, 0.38925281],
[ 0.06510739, 0.58445673],
[ 0.61469637, 0.05420098],
[ 0.92685408, 0.62715114],
[ 0.22587817, 0.56819403],
[ 0.28400409, 0.21112043]]
)
sample_indexes = [1,2,3]
# I'd rather not make this
other_indexes = list(set(range(len(data))) - set(sample_indexes))
sample_data = data[sample_indexes]
other_data = data[other_indexes]
# compare them
dists = sd.cdist(sample_data, other_data)
Is there a way to index a numpy array for indexes that are NOT the sample indexes? In my above example I make a list called other_indexes. I’d rather not have to do this for various reasons (large data set, threading, a very VERY low amount of memory on the system this is running on etc. etc. etc.). Is there a way to do something like..
other_data = data[ indexes not in sample_indexes]
I read that numpy masks can do this but I tried…
other_data = data[~sample_indexes]
And this gives me an error. Do I have to create a mask?
mask = np.ones(len(data), np.bool)
mask[sample_indexes] = 0
other_data = data[mask]
not the most elegant for what perhaps should be a single-line statement, but its fairly efficient, and the memory overhead is minimal too.
If memory is your prime concern, np.delete would avoid the creation of the mask, and fancy-indexing creates a copy anyway.
On second thought; np.delete does not modify the existing array, so its pretty much exactly the single line statement you are looking for.
I’m not familiar with the specifics on numpy
, but here’s a general solution. Suppose you have the following list:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
.
You create another list of indices you don’t want:
inds = [1, 3, 6]
.
Now simply do this:
good_data = [x for x in a if x not in inds]
, resulting in good_data = [0, 2, 4, 5, 7, 8, 9]
.
You may want to try in1d
In [5]:
select = np.in1d(range(data.shape[0]), sample_indexes)
In [6]:
print data[select]
[[ 0.99121108 0.35582816]
[ 0.90154837 0.86254049]
[ 0.83149103 0.42222948]]
In [7]:
print data[~select]
[[ 0.93825827 0.26701143]
[ 0.27309625 0.38925281]
[ 0.06510739 0.58445673]
[ 0.61469637 0.05420098]
[ 0.92685408 0.62715114]
[ 0.22587817 0.56819403]
[ 0.28400409 0.21112043]]
You may also use setdiff1d
:
In [11]: data[np.setdiff1d(np.arange(data.shape[0]), sample_indexes)]
Out[11]:
array([[ 0.93825827, 0.26701143],
[ 0.27309625, 0.38925281],
[ 0.06510739, 0.58445673],
[ 0.61469637, 0.05420098],
[ 0.92685408, 0.62715114],
[ 0.22587817, 0.56819403],
[ 0.28400409, 0.21112043]])