Fast replacement of values in a numpy array

Question:

I have a very large numpy array (containing up to a million elements) like the one below:

[0,1,6,5,1,2,7,6,2,3,8,7,3,4,9,8,5,6,11,10,6,7,12,11,7,
8,13,12,8,9,14,13,10,11,16,15,11,12,17,16,12,13,18,17,13,
14,19,18,15,16,21,20,16,17,22,21,17,18,23,22,18,19,24,23]

and a small dictionary map for replacing some of the elements in the above array

{4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

I would like to replace some of the elements according to the map above. The numpy array is really large, and only a small subset of the elements (occurring as keys in the dictionary) will be replaced with the corresponding values. What is the fastest way to do this?

Asked By: D R

||

Answers:

Well, you need to make one pass through theArray, and for each element replace it if it is in the dictionary.

for i in xrange( len( theArray ) ):
    if foo[ i ] in dict:
        foo[ i ] = dict[ foo[ i ] ]
Answered By: Katriel

I believe there’s even more efficient method, but for now, try

from numpy import copy

newArray = copy(theArray)
for k, v in d.iteritems(): newArray[theArray==k] = v

Microbenchmark and test for correctness:

#!/usr/bin/env python2.7

from numpy import copy, random, arange

random.seed(0)
data = random.randint(30, size=10**5)

d = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}
dk = d.keys()
dv = d.values()

def f1(a, d):
    b = copy(a)
    for k, v in d.iteritems():
        b[a==k] = v
    return b

def f2(a, d):
    for i in xrange(len(a)):
        a[i] = d.get(a[i], a[i])
    return a

def f3(a, dk, dv):
    mp = arange(0, max(a)+1)
    mp[dk] = dv
    return mp[a]


a = copy(data)
res = f2(a, d)

assert (f1(data, d) == res).all()
assert (f3(data, dk, dv) == res).all()

Result:

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f1(data,d)'
100 loops, best of 3: 6.15 msec per loop

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f3(data,dk,dv)'
100 loops, best of 3: 19.6 msec per loop
Answered By: kennytm
for i in xrange(len(the_array)):
    the_array[i] = the_dict.get(the_array[i], the_array[i])
Answered By: John La Rooy

Assuming the values are between 0 and some maximum integer, one could implement a fast replace by using the numpy-array as int->int dict, like below

mp = numpy.arange(0,max(data)+1)
mp[replace.keys()] = replace.values()
data = mp[data]

where first

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  4  9  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16
 21 20 16 17 22 21 17 18 23 22 18 19 24 23]

and replacing with

replace = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

we obtain

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  0  5  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  5 10 13 10 11 16 15 11 12 17 16 12 13 18 17 13 10 15 18 15 16
  1  0 16 17  2  1 17 18  3  2 18 15  0  3]
Answered By: D R

Another more general way to achieve this is function vectorization:

import numpy as np

data = np.array([0, 1, 6, 5, 1, 2, 7, 6, 2, 3, 8, 7, 3, 4, 9, 8, 5, 6, 11, 10, 6, 7, 12, 11, 7, 8, 13, 12, 8, 9, 14, 13, 10, 11, 16, 15, 11, 12, 17, 16, 12, 13, 18, 17, 13, 14, 19, 18, 15, 16, 21, 20, 16, 17, 22, 21, 17, 18, 23, 22, 18, 19, 24, 23])
mapper_dict = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

def mp(entry):
    return mapper_dict[entry] if entry in mapper_dict else entry
mp = np.vectorize(mp)

print mp(data)
Answered By: Speckinius Flecksis

No solution was posted still without a python loop on the array (except Celil’s one, which however assume numbers are “small”), so here is an alternative:

def replace(arr, rep_dict):
    """Assumes all elements of "arr" are keys of rep_dict"""

    # Removing the explicit "list" breaks python3
    rep_keys, rep_vals = array(list(zip(*sorted(rep_dict.items()))))

    idces = digitize(arr, rep_keys, right=True)
    # Notice rep_keys[digitize(arr, rep_keys, right=True)] == arr

    return rep_vals[idces]

the way “idces” is created comes from here.

Answered By: Pietro Battiston

I benchmarked some solutions, and the result is without appeal :

import timeit
import numpy as np

array = 2 * np.round(np.random.uniform(0,10000,300000)).astype(int)
from_values = np.unique(array) # pair values from 0 to 2000
to_values = np.arange(from_values.size) # all values from 0 to 1000
d = dict(zip(from_values, to_values))

def method_for_loop():
    out = array.copy()
    for from_value, to_value in zip(from_values, to_values) :
        out[out == from_value] = to_value
    print('Check method_for_loop :', np.all(out == array/2)) # Just checking
print('Time method_for_loop :', timeit.timeit(method_for_loop, number = 1))

def method_list_comprehension():
    out = [d[i] for i in array]
    print('Check method_list_comprehension :', np.all(out == array/2)) # Just checking
print('Time method_list_comprehension :', timeit.timeit(method_list_comprehension, number = 1))

def method_bruteforce():
    idx = np.nonzero(from_values == array[:,None])[1]
    out = to_values[idx]
    print('Check method_bruteforce :', np.all(out == array/2)) # Just checking
print('Time method_bruteforce :', timeit.timeit(method_bruteforce, number = 1))

def method_searchsort():
    sort_idx = np.argsort(from_values)
    idx = np.searchsorted(from_values,array,sorter = sort_idx)
    out = to_values[sort_idx][idx]
    print('Check method_searchsort :', np.all(out == array/2)) # Just checking
print('Time method_searchsort :', timeit.timeit(method_searchsort, number = 1))

And I got the following results :

Check method_for_loop : True
Time method_for_loop : 2.6411612760275602

Check method_list_comprehension : True
Time method_list_comprehension : 0.07994363596662879

Check method_bruteforce : True
Time method_bruteforce : 11.960559037979692

Check method_searchsort : True
Time method_searchsort : 0.03770717792212963

The “searchsort” method is almost a hundred times faster than the “for” loop, and about 3600 times faster than the numpy bruteforce method.
The list comprehension method is also a very good trade-off between code simplicity and speed.

Answered By: Jean Lescut

Pythonic way without the need for data to be integer, can be even strings:

from scipy.stats import rankdata
import numpy as np

data = np.random.rand(100000)
replace = {data[0]: 1, data[5]: 8, data[8]: 10}

arr = np.vstack((replace.keys(), replace.values())).transpose()
arr = arr[arr[:,1].argsort()]

unique = np.unique(data)
mp = np.vstack((unique, unique)).transpose()
mp[np.in1d(mp[:,0], arr),1] = arr[:,1]
data = mp[rankdata(data, 'dense')-1][:,1]
Answered By: caiohamamura

The numpy_indexed package (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:

import numpy_indexed as npi
remapped_array = npi.remap(theArray, list(dict.keys()), list(dict.values()))

The method implemented is similar to the searchsorted based approach mentioned by Jean Lescut, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves; yet it should achieve the same kind of performance.

Answered By: Eelco Hoogendoorn

A fully vectorized solution using np.in1d and np.searchsorted:

replace = numpy.array([list(replace.keys()), list(replace.values())])    # Create 2D replacement matrix
mask = numpy.in1d(data, replace[0, :])                                   # Find elements that need replacement
data[mask] = replace[1, numpy.searchsorted(replace[0, :], data[mask])]   # Replace elements
Answered By: Nils Werner
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.