Fastest way to generate delimited string from 1d numpy array

Question:

I have a program which needs to turn many large one-dimensional numpy arrays of floats into delimited strings. I am finding this operation quite slow relative to the mathematical operations in my program and am wondering if there is a way to speed it up. For example, consider the following loop, which takes 100,000 random numbers in a numpy array and joins each array into a comma-delimited string.

import numpy as np
x = np.random.randn(100000)
for i in range(100):
    ",".join(map(str, x))

This loop takes about 20 seconds to complete (total, not each cycle). In contrast, consider that 100 cycles of something like elementwise multiplication (x*x) would take than one 1/10 of a second to complete. Clearly the string join operation creates a large performance bottleneck; in my actual application it will dominate total runtime. This makes me wonder, is there a faster way than “,”.join(map(str, x))? Since map() is where almost all the processing time occurs, this comes down to the question of whether there a faster to way convert a very large number of numbers to strings.

Asked By: Abiel

||

Answers:

Very good writeup on the performance of various string concatenation techniques in Python: http://www.skymind.com/~ocrow/python_string/

I’m a little surprised that some of the latter approaches perform as well as they do, but looks like you can certainly find something there that will work better for you than what you’re doing there.

Fastest method mentioned on the site

Method 6: List comprehensions

def method6():
  return ''.join([`num` for num in xrange(loop_count)])

This method is the shortest. I’ll spoil the surprise and tell you it’s
also the fastest. It’s extremely compact, and also pretty
understandable. Create a list of numbers using a list comprehension
and then join them all together. Couldn’t be simpler than that. This
is really just an abbreviated version of Method 4, and it consumes
pretty much the same amount of memory. It’s faster though because we
don’t have to call the list.append() function each time round the
loop.

Answered By: sblom

I think you could experiment with numpy.savetxt passing a cStringIO.StringIO object as a fake file…

Or maybe using str(x) and doing a replacement of the whitespaces by commas (edit: this won’t work quite well because the str does an ellipsis of large arrays :-s).

As the purpose of this was to send the array over the network, maybe there are better alternatives (more efficient both in cpu and bandwidth). The one I pointed out in a comment to other answer as to encode the binary representation of the array as a Base64 text block. The main inconvenient for this to be optimal is that the client reading the chunk of data should be able to do nasty things like reinterpret a byte array as a float array, and that’s not usually allowed in type safe languages; but it could be done quickly with a C library call (and most languages provide means to do this).

In case you cannot mess with bits, there’s always the possibility of processing the numbers one by one to convert the decoded bytes to floats.

Oh, and watch out for the endiannes of the machines when sending data through the network: convert to network order -> base64encode -> send | receive -> base64decode -> convert to host order

Answered By: fortran

numpy.savetxt is even slower than string.join. ndarray.tofile() doesn’t seem to work with StringIO.

But I do find a faster method (at least applying to OP’s example on python2.5 with lower version of numpy):

import numpy as np
x = np.random.randn(100000)
for i in range(100):
    (",%f"*100000)[1:] % tuple(x)

It looks like string format is faster than string join if you have a well defined format such as in this particular case. But I wonder why OP needs such a long string of floating numbers in memory.

Newer versions of numpy shows no speed improvement.

Answered By: Dingle

Using imap from itertools instead of map in the OP’s code is giving me about a 2-3% improvement which isn’t much, but something that might combine with other ideas to give more improvement.

Personally, I think that if you want much better than this that you will have to use something like Cython.

Answered By: Justin Peel

Convert the numpy array into a list first. The map operation seems to run faster on a list than on a numpy array.

e.g.

import numpy as np
x = np.random.randn(100000).tolist()
for i in range(100):
    ",".join(map(str, x))

In timing tests I found a consistent 15% speedup for this example

I’ll leave others to explain why this might be faster as I have no idea!

Answered By: Pete W

A little late, but this is faster for me:

#generate an array with strings
x_arrstr = np.char.mod('%f', x)
#combine to a string
x_str = ",".join(x_arrstr)

Speed up is on my machine about 1.5x

Answered By: Markus R
','.join(x.astype(str))

is about 10% slower than as

x_arrstr = np.char.mod('%f', x)
x_str = ",".join(x_arrstr)

but is more readable.

Answered By: Andrey Portnoy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.