What is the most efficient way to remove a group of indices from a list of numbers in Python 2.7?
Question:
So I was wondering how I can, using Python 2.7, most efficiently take a list of values used to represent indices like this: (but with a length of up to 250,000+)
indices = [2, 4, 5]
and remove that list of indices from a larger list like this: (3,000,000+ items)
numbers = [2, 6, 12, 20, 24, 40, 42, 51]
to get a result like this:
[2, 6, 20, 42, 51]
I’m looking for an efficient solution more than anything else. I know there are many ways to do this, however that’s not my problem. Efficiency is. Also, this operation will have to be done many times and the lists will both get exponentially smaller. I do not have an equation to represent how much smaller they will get over time.
edit:
Numbers must remain sorted in a list the entire time or return to being sorted after the indices have been removed. The list called indices can either be sorted or not sorted. It doesn’t even have to be in a list.
Answers:
Another option:
>>> numbers = [2, 6, 12, 20, 24, 40, 42, 51]
>>> indicies = [2, 4, 5]
>>> offset = 0
>>> for i in indicies:
... del numbers[i - offset]
... offset += 1
...
>>> numbers
[2, 6, 20, 42, 51]
Edit:
So after being hopelessly wrong on this answer, I benchmarked each of the different approaches:
Horizontal axis is number of items, vertical is time in seconds.
The fastest option is using slicing to build a new list (from @gnibbler):
def using_slices(numbers, indices):
result = []
i = 0
for j in indices:
result += numbers[i:j]
i = j + 1
result += numbers[i:]
Surprisingly it and “sets” (@Eric) beat numpy.delete
(@Jon Clements)
Here’s the script I used, perhaps I’ve missed something.
Here’s my first approach.
def remove_indices(numbers, indices):
indices = set(indices)
return [x for i, x in enumerate(numbers) if i not in indices]
Here’s a test module to test it under the conditions you specified. (3 million elements with 250k to remove)
import random
def create_test_set():
numbers = range(3000000)
indices = random.sample(range(3000000), 250000)
return numbers, indices
def remove_indices(numbers, indices):
indices = set(indices)
return [x for i, x in enumerate(numbers) if i not in indices]
if __name__ == '__main__':
import time
numbers, indices = create_test_set()
a = time.time()
numbers = remove_indices(numbers, indices)
b = time.time()
print b - a, len(numbers)
It takes around 0.6 seconds on my laptop. You might consider making the indices a set beforehand if you’ll be using it multiple times.
(FWIW bradley.ayers solution took longer than I was willing to wait.)
Edit: This is slightly faster: (0.55 seconds)
def remove_indices(numbers, indices):
return [numbers[i] for i in xrange(len(numbers)) if i not in indices]
You may want to consider using the numpy library for efficiency (which if you’re dealing with lists of integers may not be a bad idea anyway):
>>> import numpy as np
>>> a = np.array([2, 6, 12, 20, 24, 40, 42, 51])
>>> np.delete(a, [2,4,5])
array([ 2, 6, 20, 42, 51])
Notes on np.delete
: http://docs.scipy.org/doc/numpy/reference/generated/numpy.delete.html
It might also be worth at looking at keeping the main array as is, but maintaining a masked array (haven’t done any speed tests on that either though…)
Not that efficient, but a different approach
indices = set([2, 4, 5])
result = [x for i,x in enumerate(numbers) if i not in indices]
I have a suspicion that taking whole slices between the indices might be faster than the list comprehension
def remove_indices(numbers, indices):
result = []
i=0
for j in sorted(indices):
result += numbers[i:j]
i = j+1
result += numbers[i:]
return result
Another different approach to achieve that:
>>> numbers = [2, 6, 12, 20, 24, 40, 42, 51]
>>> indices = [2, 4, 5]
>>> [item for item in numbers if numbers.index(item) not in indices]
[2, 6, 20, 42, 51]
So I was wondering how I can, using Python 2.7, most efficiently take a list of values used to represent indices like this: (but with a length of up to 250,000+)
indices = [2, 4, 5]
and remove that list of indices from a larger list like this: (3,000,000+ items)
numbers = [2, 6, 12, 20, 24, 40, 42, 51]
to get a result like this:
[2, 6, 20, 42, 51]
I’m looking for an efficient solution more than anything else. I know there are many ways to do this, however that’s not my problem. Efficiency is. Also, this operation will have to be done many times and the lists will both get exponentially smaller. I do not have an equation to represent how much smaller they will get over time.
edit:
Numbers must remain sorted in a list the entire time or return to being sorted after the indices have been removed. The list called indices can either be sorted or not sorted. It doesn’t even have to be in a list.
Another option:
>>> numbers = [2, 6, 12, 20, 24, 40, 42, 51]
>>> indicies = [2, 4, 5]
>>> offset = 0
>>> for i in indicies:
... del numbers[i - offset]
... offset += 1
...
>>> numbers
[2, 6, 20, 42, 51]
Edit:
So after being hopelessly wrong on this answer, I benchmarked each of the different approaches:
Horizontal axis is number of items, vertical is time in seconds.
The fastest option is using slicing to build a new list (from @gnibbler):
def using_slices(numbers, indices):
result = []
i = 0
for j in indices:
result += numbers[i:j]
i = j + 1
result += numbers[i:]
Surprisingly it and “sets” (@Eric) beat numpy.delete
(@Jon Clements)
Here’s the script I used, perhaps I’ve missed something.
Here’s my first approach.
def remove_indices(numbers, indices):
indices = set(indices)
return [x for i, x in enumerate(numbers) if i not in indices]
Here’s a test module to test it under the conditions you specified. (3 million elements with 250k to remove)
import random
def create_test_set():
numbers = range(3000000)
indices = random.sample(range(3000000), 250000)
return numbers, indices
def remove_indices(numbers, indices):
indices = set(indices)
return [x for i, x in enumerate(numbers) if i not in indices]
if __name__ == '__main__':
import time
numbers, indices = create_test_set()
a = time.time()
numbers = remove_indices(numbers, indices)
b = time.time()
print b - a, len(numbers)
It takes around 0.6 seconds on my laptop. You might consider making the indices a set beforehand if you’ll be using it multiple times.
(FWIW bradley.ayers solution took longer than I was willing to wait.)
Edit: This is slightly faster: (0.55 seconds)
def remove_indices(numbers, indices):
return [numbers[i] for i in xrange(len(numbers)) if i not in indices]
You may want to consider using the numpy library for efficiency (which if you’re dealing with lists of integers may not be a bad idea anyway):
>>> import numpy as np
>>> a = np.array([2, 6, 12, 20, 24, 40, 42, 51])
>>> np.delete(a, [2,4,5])
array([ 2, 6, 20, 42, 51])
Notes on np.delete
: http://docs.scipy.org/doc/numpy/reference/generated/numpy.delete.html
It might also be worth at looking at keeping the main array as is, but maintaining a masked array (haven’t done any speed tests on that either though…)
Not that efficient, but a different approach
indices = set([2, 4, 5])
result = [x for i,x in enumerate(numbers) if i not in indices]
I have a suspicion that taking whole slices between the indices might be faster than the list comprehension
def remove_indices(numbers, indices):
result = []
i=0
for j in sorted(indices):
result += numbers[i:j]
i = j+1
result += numbers[i:]
return result
Another different approach to achieve that:
>>> numbers = [2, 6, 12, 20, 24, 40, 42, 51]
>>> indices = [2, 4, 5]
>>> [item for item in numbers if numbers.index(item) not in indices]
[2, 6, 20, 42, 51]