numpy – how to add a value to every element in the first column of an array?
Question:
I have an array like this:
array([('6506', 4.6725971801473496e-25, 0.99999999995088695),
('6601', 2.2452745388799898e-27, 0.99999999995270605),
('21801', 1.9849650921836601e-31, 0.99999999997999001), ...,
('45164194', 1.0413482803123399e-24, 0.99999999997453404),
('45164198', 1.09470356446595e-24, 0.99999999997635303),
('45164519', 3.7521365799080699e-24, 0.99999999997453404)],
dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])
And I want to turn it into this: (adding a prefix ‘2R’ onto each value in the first column)
array([('2R:6506', 4.6725971801473496e-25, 0.99999999995088695),
('2R:6601', 2.2452745388799898e-27, 0.99999999995270605),
('2R:21801', 1.9849650921836601e-31, 0.99999999997999001), ...,
('2R:45164194', 1.0413482803123399e-24, 0.99999999997453404),
('2R:45164198', 1.09470356446595e-24, 0.99999999997635303),
('2R:45164519', 3.7521365799080699e-24, 0.99999999997453404)],
dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])
I looked up some stuff about nditer (but I want to support earlier versions of numpy.) Also I’m reading one should avoid iteration.
Answers:
A simple (albeit perhaps not optimal) solution is just:
a = np.array([('6506', 4.6725971801473496e-25, 0.99999999995088695),
('6601', 2.2452745388799898e-27, 0.99999999995270605),
('21801', 1.9849650921836601e-31, 0.99999999997999001),
('45164194', 1.0413482803123399e-24, 0.99999999997453404),
('45164198', 1.09470356446595e-24, 0.99999999997635303),
('45164519', 3.7521365799080699e-24, 0.99999999997453404)],
dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])
a['pos'] = [''.join(('2R:',x)) for x in a['pos']]
In [11]: a
Out[11]:
array([('2R:6506', 4.67259718014735e-25, 0.999999999950887),
('2R:6601', 2.24527453887999e-27, 0.999999999952706),
('2R:21801', 1.98496509218366e-31, 0.99999999997999),
('2R:45164194', 1.04134828031234e-24, 0.999999999974534),
('2R:45164198', 1.09470356446595e-24, 0.999999999976353),
('2R:45164519', 3.75213657990807e-24, 0.999999999974534)],
dtype=[('pos', 'S100'), ('par1', '<f8'), ('par2', '<f8')])
While I like @falsetru’s answer for using core numpy routines, surprisingly, list comprehension seems a bit faster:
In [19]: a = np.empty(20000, dtype=[('pos', 'S100'), ('par1', '<f8'), ('par2', '<f8')])
In [20]: %timeit a['pos'] = [''.join(('2R:',x)) for x in a['pos']]
100 loops, best of 3: 11.1 ms per loop
In [21]: %timeit a['pos'] = add('2R:', a['pos'])
100 loops, best of 3: 15.7 ms per loop
Definitely benchmark your own use case and hardware to see which makes more sense for your actual application though. One of the things I’ve learned is that in certain situations, basic python constructs can outperform numpy built-ins, depending on the task at hand.
Using numpy.core.defchararray.add
:
>>> from numpy import array
>>> from numpy.core.defchararray import add
>>>
>>> xs = array([('6506', 4.6725971801473496e-25, 0.99999999995088695),
... ('6601', 2.2452745388799898e-27, 0.99999999995270605),
... ('21801', 1.9849650921836601e-31, 0.99999999997999001),
... ('45164194', 1.0413482803123399e-24, 0.99999999997453404),
... ('45164198', 1.09470356446595e-24, 0.99999999997635303),
... ('45164519', 3.7521365799080699e-24, 0.99999999997453404)],
... dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])
>>> xs['pos'] = add('2R:', xs['pos'])
>>> xs
array([('2R:6506', 4.67259718014735e-25, 0.999999999950887),
('2R:6601', 2.24527453887999e-27, 0.999999999952706),
('2R:21801', 1.98496509218366e-31, 0.99999999997999),
('2R:45164194', 1.04134828031234e-24, 0.999999999974534),
('2R:45164198', 1.09470356446595e-24, 0.999999999976353),
('2R:45164519', 3.75213657990807e-24, 0.999999999974534)],
dtype=[('pos', 'S100'), ('par1', '<f8'), ('par2', '<f8')])
UPDATE: You can use num.char.add
instead of numpy.core.defchararray.add
(commented by @joel-buursma):
>>> import numpy
>>> numpy.char == numpy.core.defchararray
True
Another slightly faster solution is to use list comprehension with +
operator. Though I do not understand why it is faster. But it is definitely very elegant and basic.
a['pos'] = ["2R:" + x for x in a['pos']]
Timings:
%timeit a['pos'] = ["2R:" + x for x in a['pos']]
8.07 ms ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit a['pos'] = [''.join(('2R:',x)) for x in a['pos']]
9.53 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit a['pos'] = add('2R:', a['pos'])
14.2 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
PS: I created the array a
using slightly different definition:
a = np.empty(20000, dtype=[('pos', 'U5'), ('par1', '<f8'), ('par2', '<f8')])
as if I use type Sxxx
for pos
, concatenation produces a type error for me.
I have an array like this:
array([('6506', 4.6725971801473496e-25, 0.99999999995088695),
('6601', 2.2452745388799898e-27, 0.99999999995270605),
('21801', 1.9849650921836601e-31, 0.99999999997999001), ...,
('45164194', 1.0413482803123399e-24, 0.99999999997453404),
('45164198', 1.09470356446595e-24, 0.99999999997635303),
('45164519', 3.7521365799080699e-24, 0.99999999997453404)],
dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])
And I want to turn it into this: (adding a prefix ‘2R’ onto each value in the first column)
array([('2R:6506', 4.6725971801473496e-25, 0.99999999995088695),
('2R:6601', 2.2452745388799898e-27, 0.99999999995270605),
('2R:21801', 1.9849650921836601e-31, 0.99999999997999001), ...,
('2R:45164194', 1.0413482803123399e-24, 0.99999999997453404),
('2R:45164198', 1.09470356446595e-24, 0.99999999997635303),
('2R:45164519', 3.7521365799080699e-24, 0.99999999997453404)],
dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])
I looked up some stuff about nditer (but I want to support earlier versions of numpy.) Also I’m reading one should avoid iteration.
A simple (albeit perhaps not optimal) solution is just:
a = np.array([('6506', 4.6725971801473496e-25, 0.99999999995088695),
('6601', 2.2452745388799898e-27, 0.99999999995270605),
('21801', 1.9849650921836601e-31, 0.99999999997999001),
('45164194', 1.0413482803123399e-24, 0.99999999997453404),
('45164198', 1.09470356446595e-24, 0.99999999997635303),
('45164519', 3.7521365799080699e-24, 0.99999999997453404)],
dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])
a['pos'] = [''.join(('2R:',x)) for x in a['pos']]
In [11]: a
Out[11]:
array([('2R:6506', 4.67259718014735e-25, 0.999999999950887),
('2R:6601', 2.24527453887999e-27, 0.999999999952706),
('2R:21801', 1.98496509218366e-31, 0.99999999997999),
('2R:45164194', 1.04134828031234e-24, 0.999999999974534),
('2R:45164198', 1.09470356446595e-24, 0.999999999976353),
('2R:45164519', 3.75213657990807e-24, 0.999999999974534)],
dtype=[('pos', 'S100'), ('par1', '<f8'), ('par2', '<f8')])
While I like @falsetru’s answer for using core numpy routines, surprisingly, list comprehension seems a bit faster:
In [19]: a = np.empty(20000, dtype=[('pos', 'S100'), ('par1', '<f8'), ('par2', '<f8')])
In [20]: %timeit a['pos'] = [''.join(('2R:',x)) for x in a['pos']]
100 loops, best of 3: 11.1 ms per loop
In [21]: %timeit a['pos'] = add('2R:', a['pos'])
100 loops, best of 3: 15.7 ms per loop
Definitely benchmark your own use case and hardware to see which makes more sense for your actual application though. One of the things I’ve learned is that in certain situations, basic python constructs can outperform numpy built-ins, depending on the task at hand.
Using numpy.core.defchararray.add
:
>>> from numpy import array
>>> from numpy.core.defchararray import add
>>>
>>> xs = array([('6506', 4.6725971801473496e-25, 0.99999999995088695),
... ('6601', 2.2452745388799898e-27, 0.99999999995270605),
... ('21801', 1.9849650921836601e-31, 0.99999999997999001),
... ('45164194', 1.0413482803123399e-24, 0.99999999997453404),
... ('45164198', 1.09470356446595e-24, 0.99999999997635303),
... ('45164519', 3.7521365799080699e-24, 0.99999999997453404)],
... dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])
>>> xs['pos'] = add('2R:', xs['pos'])
>>> xs
array([('2R:6506', 4.67259718014735e-25, 0.999999999950887),
('2R:6601', 2.24527453887999e-27, 0.999999999952706),
('2R:21801', 1.98496509218366e-31, 0.99999999997999),
('2R:45164194', 1.04134828031234e-24, 0.999999999974534),
('2R:45164198', 1.09470356446595e-24, 0.999999999976353),
('2R:45164519', 3.75213657990807e-24, 0.999999999974534)],
dtype=[('pos', 'S100'), ('par1', '<f8'), ('par2', '<f8')])
UPDATE: You can use num.char.add
instead of numpy.core.defchararray.add
(commented by @joel-buursma):
>>> import numpy
>>> numpy.char == numpy.core.defchararray
True
Another slightly faster solution is to use list comprehension with +
operator. Though I do not understand why it is faster. But it is definitely very elegant and basic.
a['pos'] = ["2R:" + x for x in a['pos']]
Timings:
%timeit a['pos'] = ["2R:" + x for x in a['pos']]
8.07 ms ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit a['pos'] = [''.join(('2R:',x)) for x in a['pos']]
9.53 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit a['pos'] = add('2R:', a['pos'])
14.2 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
PS: I created the array a
using slightly different definition:
a = np.empty(20000, dtype=[('pos', 'U5'), ('par1', '<f8'), ('par2', '<f8')])
as if I use type Sxxx
for pos
, concatenation produces a type error for me.