Speeding up pairing of strings into objects in Python
Question:
I’m trying to find an efficient way to pair together rows of data containing integer points, and storing them as Python objects. The data is made up of X
and Y
coordinate points, represented as a comma separated strings. The points have to be paired, as in (x_1, y_1), (x_2, y_2), ...
etc. and then stored as a list of objects, where each point is an object. The function below get_data
generates this example data:
def get_data(N=100000, M=10):
import random
data = []
for n in range(N):
pair = [[str(random.randint(1, 10)) for x in range(M)],
[str(random.randint(1, 10)) for x in range(M)]]
row = [",".join(pair[0]),
",".join(pair[1])]
data.append(row)
return data
The parsing code I have now is:
class Point:
def __init__(self, a, b):
self.a = a
self.b = b
def test():
import time
data = get_data()
all_point_sets = []
time_start = time.time()
for row in data:
point_set = []
first_points, second_points = row
# Convert points from strings to integers
first_points = map(int, first_points.split(","))
second_points = map(int, second_points.split(","))
paired_points = zip(first_points, second_points)
curr_points = [Point(p[0], p[1])
for p in paired_points]
all_point_sets.append(curr_points)
time_end = time.time()
print "total time: ", (time_end - time_start)
Currently, this takes nearly 7 seconds for 100,000 points, which seems very inefficient. Part of the inefficiency seems to stem from the calculation of first_points
, second_points
and paired_points
– and the conversion of these into objects.
Another part of the inefficiency seems to be the building up of all_point_sets
. Taking out the all_point_sets.append(...)
line seems to make the code go from ~7 seconds to 2 seconds!
How can this be sped up?
FOLLOWUP Thanks for everyone’s great suggestions – they were all helpful. but even with all the improvements, it’s still about 3 seconds to process 100,000 entries. I’m not sure why in this case it’s not just instant, and whether there’s an alternative representation that would make it instant. Would coding this in Cython change things? Could someone offer an example of that? thanks again.
Answers:
I don’t know if there’s much you can do.
You can use generator to avoid the extra memory allocations. This gives me about a 5% speedup.
first_points = (int(p) for p in first_points .split(","))
second_points = (int(p) for p in second_points.split(","))
paired_points = itertools.izip(first_points, second_points)
curr_points = [Point(x, y) for x,y in paired_points]
Even collapsing the entire loop into one massive list comprehension doesn’t do much.
all_point_sets = [
[Point(int(x), int(y)) for x, y in itertools.izip(xs.split(','), ys.split(','))]
for xs, ys in data
]
If you go on to iterate over this big list then you could turn it into a generator. That would spread out the cost of parsing the CSV data so you don’t get a big upfront hit.
all_point_sets = (
[Point(int(x), int(y)) for x, y in itertools.izip(xs.split(','), ys.split(','))]
for xs, ys in data
)
-
make Point
a namedtuple
(~10% speedup):
from collections import namedtuple
Point = namedtuple('Point', 'a b')
-
unpack during iteration (~2-4% speedup):
for xs, ys in data:
-
use n
-argument form of map
to avoid zip (~10% speedup):
curr_points = map(Point,
map(int, xs.split(',')),
map(int, ys.split(',')),
)
Given that the point sets are short, generators are probably overkill as they have a higher fixed overhead.
You can shave a few seconds off:
class Point2(object):
__slots__ = ['a','b']
def __init__(self, a, b):
self.a = a
self.b = b
def test_new(data):
all_point_sets = []
for row in data:
first_points, second_points = row
r0 = map(int, first_points.split(","))
r1 = map(int, second_points.split(","))
cp = map(Point2, r0, r1)
all_point_sets.append(cp)
which gave me
In [24]: %timeit test(d)
1 loops, best of 3: 5.07 s per loop
In [25]: %timeit test_new(d)
1 loops, best of 3: 3.29 s per loop
I was intermittently able to shave another 0.3s off by preallocating space in all_point_sets
but that could be just noise. And of course there’s the old-fashioned way of making things faster:
localhost-2:coding $ pypy pointexam.py
1.58351397514
When dealing with the creating of large numbers of objects, often the single biggest performance enhancement you can use is to turn the garbage collector off. Every “generation” of objects, the garbage collector traverses all the live objects in memory, looking for objects that are a part of cycles but are not pointed at by live objects, thus are eligible for memory reclamation. See Doug Helmann’s PyMOTW GC article for some information (more can perhaps be found with google and some determination). The garbage collector is run by default every 700 or so objects created-but-not-reclaimed, with subsequent generations running somewhat less often (I forget the exact details).
Using a standard tuple instead of a Point class can save you some time (using a namedtuple is somewhere in between), and clever unpacking can save you some time, but the largest gain can be had by turning the gc off before your creation of lots of objects that you know don’t need to be gc’d, and then turning it back on afterwards.
Some code:
def orig_test_gc_off():
import time
data = get_data()
all_point_sets = []
import gc
gc.disable()
time_start = time.time()
for row in data:
point_set = []
first_points, second_points = row
# Convert points from strings to integers
first_points = map(int, first_points.split(","))
second_points = map(int, second_points.split(","))
paired_points = zip(first_points, second_points)
curr_points = [Point(p[0], p[1])
for p in paired_points]
all_point_sets.append(curr_points)
time_end = time.time()
gc.enable()
print "gc off total time: ", (time_end - time_start)
def test1():
import time
import gc
data = get_data()
all_point_sets = []
time_start = time.time()
gc.disable()
for index, row in enumerate(data):
first_points, second_points = row
curr_points = map(
Point,
[int(i) for i in first_points.split(",")],
[int(i) for i in second_points.split(",")])
all_point_sets.append(curr_points)
time_end = time.time()
gc.enable()
print "variant 1 total time: ", (time_end - time_start)
def test2():
import time
import gc
data = get_data()
all_point_sets = []
gc.disable()
time_start = time.time()
for index, row in enumerate(data):
first_points, second_points = row
first_points = [int(i) for i in first_points.split(",")]
second_points = [int(i) for i in second_points.split(",")]
curr_points = [(x, y) for x, y in zip(first_points, second_points)]
all_point_sets.append(curr_points)
time_end = time.time()
gc.enable()
print "variant 2 total time: ", (time_end - time_start)
orig_test()
orig_test_gc_off()
test1()
test2()
Some results:
>>> %run /tmp/flup.py
total time: 6.90738511086
gc off total time: 4.94075202942
variant 1 total time: 4.41632509232
variant 2 total time: 3.23905301094
I got a 50% improvement by using arrays, and a holder object that lazily constructs Point objects when accessed. I also “slotted” the Point object for better storage efficiency. However, a tuple would probably be better.
Changing the data structure may also help, if that’s possible. But this will never be instantaneous.
from array import array
class Point(object):
__slots__ = ["a", "b"]
def __init__(self, a, b):
self.a = a
self.b = b
def __repr__(self):
return "Point(%d, %d)" % (self.a, self.b)
class Points(object):
def __init__(self, xs, ys):
self.xs = xs
self.ys = ys
def __getitem__(self, i):
return Point(self.xs[i], self.ys[i])
def test3():
xs = array("i")
ys = array("i")
time_start = time.time()
for row in data:
xs.extend([int(val) for val in row[0].split(",")])
ys.extend([int(val) for val in row[1].split(",")])
print ("total time: ", (time.time() - time_start))
return Points(xs, ys)
But when dealing with large amounts of data I would usually use numpy N dimensional arrays (ndarray). If the original data structure could be altered then that would likely be fastest of all. If it could be structured to read x,y pairs in linearly and then reshape the ndarray.
I would
- use
numpy
arrays for this problem (Cython
would be an option, if this is still not fast enough).
- store the points as a vector not as single
Point
instances.
- rely on existing parsers
- (if possible) parse the data once and than store it in a binary format like hdf5 for further calculations, which will be the fastest option (see below)
Numpy has built in functions to read text files, for instance loadtxt
.
If you have the data stored in a structured array, you do not necessarily need to convert it to another data type.
I will use Pandas which is a library build on top of numpy
. It is a bit more convenient for handling and processing structured data. Pandas
has its own file parser read_csv
.
To time it, I wrote the data to a file, like in your original problem (it is based on your get_data
):
import numpy as np
import pandas as pd
def create_example_file(n=100000, m=20):
ex1 = pd.DataFrame(np.random.randint(1, 10, size=(10e4, m)),
columns=(['x_%d' % x for x in range(10)] +
['y_%d' % y for y in range(10)]))
ex1.to_csv('example.csv', index=False, header=False)
return
This is the code I used to read the data in a pandas.DataFrame
:
def with_read_csv(csv_file):
df = pd.read_csv(csv_file, header=None,
names=(['x_%d' % x for x in range(10)] +
['y_%d' % y for y in range(10)]))
return df
(Note that I assumed, that there is no header in your file and so I had to create the column names.)
Reading the data is fast, it should be more memory efficient (see this question) and the data is stored in a data structure you can further work with in a fast, vectorized way:
In [18]: %timeit string_to_object.with_read_csv('example.csv')
1 loops, best of 3: 553 ms per loop
There is a new C based parser in an development branch which takes 414 ms on my system.
Your test takes 2.29 s on my system, but it is not really comparable, as the data is not read from a file and you created the Point
instances.
If you have once read in the data you can store it in a hdf5
file:
In [19]: store = pd.HDFStore('example.h5')
In [20]: store['data'] = df
In [21]: store.close()
Next time you need the data you can read it from this file, which is really fast:
In [1]: store = pd.HDFStore('example.h5')
In [2]: %timeit df = store['data']
100 loops, best of 3: 16.5 ms per loop
However it will only be applicable, if you need the same data more than one time.
Using numpy
based arrays with large data sets will have advantages when you are doing further calculations. Cython
wouldn’t necessarily be faster if you can use vectorized numpy
functions and indexing, it will be faster if you really need iteration (see also this answer).
Simply running with pypy makes a big difference
$ python pairing_strings.py
total time: 2.09194397926
$ pypy pairing_strings.py
total time: 0.764246940613
disable gc didn’t help for pypy
$ pypy pairing_strings.py
total time: 0.763386964798
namedtuple for Point makes it worse
$ pypy pairing_strings.py
total time: 0.888827085495
using itertools.imap, and itertools.izip
$ pypy pairing_strings.py
total time: 0.615751981735
Using a memoized version of int and an iterator to avoid the zip
$ pypy pairing_strings.py
total time: 0.423738002777
Here is the code I finished with.
def test():
import time
def m_int(s, memo={}):
if s in memo:
return memo[s]
else:
retval = memo[s] = int(s)
return retval
data = get_data()
all_point_sets = []
time_start = time.time()
for xs, ys in data:
point_set = []
# Convert points from strings to integers
y_iter = iter(ys.split(","))
curr_points = [Point(m_int(i), m_int(next(y_iter))) for i in xs.split(",")]
all_point_sets.append(curr_points)
time_end = time.time()
print "total time: ", (time_end - time_start)
The data is a tab separated file, which consists of lists of comma
separated integers.
Using sample get_data()
I made a .csv
file like this:
1,6,2,8,2,3,5,9,6,6 10,4,10,5,7,9,6,1,9,5
6,2,2,5,2,2,1,7,7,9 7,6,7,1,3,7,6,2,10,5
8,8,9,2,6,10,10,7,8,9 4,2,10,3,4,4,1,2,2,9
...
Then I abused C-optimized parsing via JSON:
def test2():
import json
import time
time_start = time.time()
with open('data.csv', 'rb') as f:
data = f.read()
data = '[[[' + ']],[['.join(data.splitlines()).replace('t', '],[') + ']]]'
all_point_sets = [Point(*xy) for row in json.loads(data) for xy in zip(*row)]
time_end = time.time()
print "total time: ", (time_end - time_start)
Results on my box: your original test()
~8s, with gc disabled ~6s, while my version (I/O included) gives ~6s and ~4s respectively. Ie about ~50% speed up. But looking at profiler data it’s obvious that biggest bottleneck is in object instantiation itself, so Matt Anderson‘s answer would net you the most gain on CPython.
Faster method, using Numpy (speedup of about 7x):
import numpy as np
txt = ','.join(','.join(row) for row in data)
arr = np.fromstring(txt, dtype=int, sep=',')
return arr.reshape(100000, 2, 10).transpose((0,2,1))
Performance comparison:
def load_1(data):
all_point_sets = []
gc.disable()
for xs, ys in data:
all_point_sets.append(zip(map(int, xs.split(',')), map(int, ys.split(','))))
gc.enable()
return all_point_sets
def load_2(data):
txt = ','.join(','.join(row) for row in data)
arr = np.fromstring(txt, dtype=int, sep=',')
return arr.reshape(100000, 2, 10).transpose((0,2,1))
load_1
runs in 1.52 seconds on my machine; load_2
runs in 0.20 seconds, a 7-fold improvement. The big caveat here is that it requires that you (1) know the lengths of everything in advance, and (2) that every row contains the exact same number of points. This is true for your get_data
output, but may not be true for your real dataset.
cython is able to speed things up by a factor of 5.5
$ python split.py
total time: 2.16252303123
total time: 0.393486022949
Here is the code I used
split.py
import time
import pyximport; pyximport.install()
from split_ import test_
def get_data(N=100000, M=10):
import random
data = []
for n in range(N):
pair = [[str(random.randint(1, 100)) for x in range(M)],
[str(random.randint(1, 100)) for x in range(M)]]
row = [",".join(pair[0]),
",".join(pair[1])]
data.append(row)
return data
class Point:
def __init__(self, a, b):
self.a = a
self.b = b
def test(data):
all_point_sets = []
for row in data:
point_set = []
first_points, second_points = row
# Convert points from strings to integers
first_points = map(int, first_points.split(","))
second_points = map(int, second_points.split(","))
paired_points = zip(first_points, second_points)
curr_points = [Point(p[0], p[1])
for p in paired_points]
all_point_sets.append(curr_points)
return all_point_sets
data = get_data()
for func in test, test_:
time_start = time.time()
res = func(data)
time_end = time.time()
print "total time: ", (time_end - time_start)
split_.pyx
from libc.string cimport strsep
from libc.stdlib cimport atoi
cdef class Point:
cdef public int a,b
def __cinit__(self, a, b):
self.a = a
self.b = b
def test_(data):
cdef char *xc, *yc, *xt, *yt
cdef char **xcp, **ycp
all_point_sets = []
for xs, ys in data:
xc = xs
xcp = &xc
yc = ys
ycp = &yc
point_set = []
while True:
xt = strsep(xcp, ',')
if xt is NULL:
break
yt = strsep(ycp, ",")
point_set.append(Point(atoi(xt), atoi(yt)))
all_point_sets.append(point_set)
return all_point_sets
poking around further I can approximately break down some of the cpu resources
5% strsep()
9% atoi()
23% creating Point instances
35% all_point_sets.append(point_set)
I would expect there may be an improvement if the cython was able to read from a csv(or whatever) file directly instead of having to trawl through a Python object.
As the time taken for built in functions such as zip(a,b)
or map(int, string.split(","))
for arrays of length 2000000 is negligible I have to presume that the most time consuming operation is append.
Thus the proper way to address the problem is to recursively concatenate the strings:
10 strings of 10 elements to a bigger string
10 strings of 100 elements
10 strings of 1000 elements
and finally to zip(map(int,huge_string_a.split(",")),map(int,huge_string_b.split(",")));
It’s then just fine tuning to find the optimal base N for the append and conquer method.
There are many good answers here. One side of this issue not addressed so far, however, is the list-to-string time cost differences between the various iterator implementations in python.
There is an essay testing the efficiency of different iterators with respect the list-to-string conversion on Python.org essays: list2str.
Bear in mind that when I ran into similar optimization problems, but with different data structure and sizes, the results presented in the essay did not all scale up at an equal rate, so it’s worthwhile testing the different iterator implementations for your particular use case.
How attached are you to having your coordinates accessible as .x
and .y
attributes? To my surprise, my tests show that the biggest single time sink was not the calls to list.append()
, but the construction of the Point
objects. They take four times as long to construct as a tuple, and there are a lot of them. Simply replacing Point(int(x), int(y))
with a tuple (int(x), int(y))
in your code shaved over 50% off the total execution time (Python 2.6 on Win XP). Perhaps your current code still has room to optimize this?
If you are really set on accessing the coordinates with .x
and .y
, you can try using collections.namedtuple
. It’s not as fast as plain tuples, but seems to be much faster than the Pair class in your code (I’m hedging because a separate timing benchmark gave me weird results).
Pair = namedtuple("Pair", "x y") # instead of the Point class
...
curr_points = [ Pair(x, y) for x, y in paired_points ]
If you need to go this route, it also pays off to derive a class from tuple (minimal cost over plain tuple). I can provide details if requested.
PS I see @MattAnderson mentioned the object-tuple issue long ago. But it’s a major effect (on my box at least), even before disabling garbage collection.
Original code: total time: 15.79
tuple instead of Point: total time: 7.328
namedtuple instead of Point: total time: 9.140
I’m trying to find an efficient way to pair together rows of data containing integer points, and storing them as Python objects. The data is made up of X
and Y
coordinate points, represented as a comma separated strings. The points have to be paired, as in (x_1, y_1), (x_2, y_2), ...
etc. and then stored as a list of objects, where each point is an object. The function below get_data
generates this example data:
def get_data(N=100000, M=10):
import random
data = []
for n in range(N):
pair = [[str(random.randint(1, 10)) for x in range(M)],
[str(random.randint(1, 10)) for x in range(M)]]
row = [",".join(pair[0]),
",".join(pair[1])]
data.append(row)
return data
The parsing code I have now is:
class Point:
def __init__(self, a, b):
self.a = a
self.b = b
def test():
import time
data = get_data()
all_point_sets = []
time_start = time.time()
for row in data:
point_set = []
first_points, second_points = row
# Convert points from strings to integers
first_points = map(int, first_points.split(","))
second_points = map(int, second_points.split(","))
paired_points = zip(first_points, second_points)
curr_points = [Point(p[0], p[1])
for p in paired_points]
all_point_sets.append(curr_points)
time_end = time.time()
print "total time: ", (time_end - time_start)
Currently, this takes nearly 7 seconds for 100,000 points, which seems very inefficient. Part of the inefficiency seems to stem from the calculation of first_points
, second_points
and paired_points
– and the conversion of these into objects.
Another part of the inefficiency seems to be the building up of all_point_sets
. Taking out the all_point_sets.append(...)
line seems to make the code go from ~7 seconds to 2 seconds!
How can this be sped up?
FOLLOWUP Thanks for everyone’s great suggestions – they were all helpful. but even with all the improvements, it’s still about 3 seconds to process 100,000 entries. I’m not sure why in this case it’s not just instant, and whether there’s an alternative representation that would make it instant. Would coding this in Cython change things? Could someone offer an example of that? thanks again.
I don’t know if there’s much you can do.
You can use generator to avoid the extra memory allocations. This gives me about a 5% speedup.
first_points = (int(p) for p in first_points .split(","))
second_points = (int(p) for p in second_points.split(","))
paired_points = itertools.izip(first_points, second_points)
curr_points = [Point(x, y) for x,y in paired_points]
Even collapsing the entire loop into one massive list comprehension doesn’t do much.
all_point_sets = [
[Point(int(x), int(y)) for x, y in itertools.izip(xs.split(','), ys.split(','))]
for xs, ys in data
]
If you go on to iterate over this big list then you could turn it into a generator. That would spread out the cost of parsing the CSV data so you don’t get a big upfront hit.
all_point_sets = (
[Point(int(x), int(y)) for x, y in itertools.izip(xs.split(','), ys.split(','))]
for xs, ys in data
)
-
make
Point
anamedtuple
(~10% speedup):from collections import namedtuple Point = namedtuple('Point', 'a b')
-
unpack during iteration (~2-4% speedup):
for xs, ys in data:
-
use
n
-argument form ofmap
to avoid zip (~10% speedup):curr_points = map(Point, map(int, xs.split(',')), map(int, ys.split(',')), )
Given that the point sets are short, generators are probably overkill as they have a higher fixed overhead.
You can shave a few seconds off:
class Point2(object):
__slots__ = ['a','b']
def __init__(self, a, b):
self.a = a
self.b = b
def test_new(data):
all_point_sets = []
for row in data:
first_points, second_points = row
r0 = map(int, first_points.split(","))
r1 = map(int, second_points.split(","))
cp = map(Point2, r0, r1)
all_point_sets.append(cp)
which gave me
In [24]: %timeit test(d)
1 loops, best of 3: 5.07 s per loop
In [25]: %timeit test_new(d)
1 loops, best of 3: 3.29 s per loop
I was intermittently able to shave another 0.3s off by preallocating space in all_point_sets
but that could be just noise. And of course there’s the old-fashioned way of making things faster:
localhost-2:coding $ pypy pointexam.py
1.58351397514
When dealing with the creating of large numbers of objects, often the single biggest performance enhancement you can use is to turn the garbage collector off. Every “generation” of objects, the garbage collector traverses all the live objects in memory, looking for objects that are a part of cycles but are not pointed at by live objects, thus are eligible for memory reclamation. See Doug Helmann’s PyMOTW GC article for some information (more can perhaps be found with google and some determination). The garbage collector is run by default every 700 or so objects created-but-not-reclaimed, with subsequent generations running somewhat less often (I forget the exact details).
Using a standard tuple instead of a Point class can save you some time (using a namedtuple is somewhere in between), and clever unpacking can save you some time, but the largest gain can be had by turning the gc off before your creation of lots of objects that you know don’t need to be gc’d, and then turning it back on afterwards.
Some code:
def orig_test_gc_off():
import time
data = get_data()
all_point_sets = []
import gc
gc.disable()
time_start = time.time()
for row in data:
point_set = []
first_points, second_points = row
# Convert points from strings to integers
first_points = map(int, first_points.split(","))
second_points = map(int, second_points.split(","))
paired_points = zip(first_points, second_points)
curr_points = [Point(p[0], p[1])
for p in paired_points]
all_point_sets.append(curr_points)
time_end = time.time()
gc.enable()
print "gc off total time: ", (time_end - time_start)
def test1():
import time
import gc
data = get_data()
all_point_sets = []
time_start = time.time()
gc.disable()
for index, row in enumerate(data):
first_points, second_points = row
curr_points = map(
Point,
[int(i) for i in first_points.split(",")],
[int(i) for i in second_points.split(",")])
all_point_sets.append(curr_points)
time_end = time.time()
gc.enable()
print "variant 1 total time: ", (time_end - time_start)
def test2():
import time
import gc
data = get_data()
all_point_sets = []
gc.disable()
time_start = time.time()
for index, row in enumerate(data):
first_points, second_points = row
first_points = [int(i) for i in first_points.split(",")]
second_points = [int(i) for i in second_points.split(",")]
curr_points = [(x, y) for x, y in zip(first_points, second_points)]
all_point_sets.append(curr_points)
time_end = time.time()
gc.enable()
print "variant 2 total time: ", (time_end - time_start)
orig_test()
orig_test_gc_off()
test1()
test2()
Some results:
>>> %run /tmp/flup.py
total time: 6.90738511086
gc off total time: 4.94075202942
variant 1 total time: 4.41632509232
variant 2 total time: 3.23905301094
I got a 50% improvement by using arrays, and a holder object that lazily constructs Point objects when accessed. I also “slotted” the Point object for better storage efficiency. However, a tuple would probably be better.
Changing the data structure may also help, if that’s possible. But this will never be instantaneous.
from array import array
class Point(object):
__slots__ = ["a", "b"]
def __init__(self, a, b):
self.a = a
self.b = b
def __repr__(self):
return "Point(%d, %d)" % (self.a, self.b)
class Points(object):
def __init__(self, xs, ys):
self.xs = xs
self.ys = ys
def __getitem__(self, i):
return Point(self.xs[i], self.ys[i])
def test3():
xs = array("i")
ys = array("i")
time_start = time.time()
for row in data:
xs.extend([int(val) for val in row[0].split(",")])
ys.extend([int(val) for val in row[1].split(",")])
print ("total time: ", (time.time() - time_start))
return Points(xs, ys)
But when dealing with large amounts of data I would usually use numpy N dimensional arrays (ndarray). If the original data structure could be altered then that would likely be fastest of all. If it could be structured to read x,y pairs in linearly and then reshape the ndarray.
I would
- use
numpy
arrays for this problem (Cython
would be an option, if this is still not fast enough). - store the points as a vector not as single
Point
instances. - rely on existing parsers
- (if possible) parse the data once and than store it in a binary format like hdf5 for further calculations, which will be the fastest option (see below)
Numpy has built in functions to read text files, for instance loadtxt
.
If you have the data stored in a structured array, you do not necessarily need to convert it to another data type.
I will use Pandas which is a library build on top of numpy
. It is a bit more convenient for handling and processing structured data. Pandas
has its own file parser read_csv
.
To time it, I wrote the data to a file, like in your original problem (it is based on your get_data
):
import numpy as np
import pandas as pd
def create_example_file(n=100000, m=20):
ex1 = pd.DataFrame(np.random.randint(1, 10, size=(10e4, m)),
columns=(['x_%d' % x for x in range(10)] +
['y_%d' % y for y in range(10)]))
ex1.to_csv('example.csv', index=False, header=False)
return
This is the code I used to read the data in a pandas.DataFrame
:
def with_read_csv(csv_file):
df = pd.read_csv(csv_file, header=None,
names=(['x_%d' % x for x in range(10)] +
['y_%d' % y for y in range(10)]))
return df
(Note that I assumed, that there is no header in your file and so I had to create the column names.)
Reading the data is fast, it should be more memory efficient (see this question) and the data is stored in a data structure you can further work with in a fast, vectorized way:
In [18]: %timeit string_to_object.with_read_csv('example.csv')
1 loops, best of 3: 553 ms per loop
There is a new C based parser in an development branch which takes 414 ms on my system.
Your test takes 2.29 s on my system, but it is not really comparable, as the data is not read from a file and you created the Point
instances.
If you have once read in the data you can store it in a hdf5
file:
In [19]: store = pd.HDFStore('example.h5')
In [20]: store['data'] = df
In [21]: store.close()
Next time you need the data you can read it from this file, which is really fast:
In [1]: store = pd.HDFStore('example.h5')
In [2]: %timeit df = store['data']
100 loops, best of 3: 16.5 ms per loop
However it will only be applicable, if you need the same data more than one time.
Using numpy
based arrays with large data sets will have advantages when you are doing further calculations. Cython
wouldn’t necessarily be faster if you can use vectorized numpy
functions and indexing, it will be faster if you really need iteration (see also this answer).
Simply running with pypy makes a big difference
$ python pairing_strings.py
total time: 2.09194397926
$ pypy pairing_strings.py
total time: 0.764246940613
disable gc didn’t help for pypy
$ pypy pairing_strings.py
total time: 0.763386964798
namedtuple for Point makes it worse
$ pypy pairing_strings.py
total time: 0.888827085495
using itertools.imap, and itertools.izip
$ pypy pairing_strings.py
total time: 0.615751981735
Using a memoized version of int and an iterator to avoid the zip
$ pypy pairing_strings.py
total time: 0.423738002777
Here is the code I finished with.
def test():
import time
def m_int(s, memo={}):
if s in memo:
return memo[s]
else:
retval = memo[s] = int(s)
return retval
data = get_data()
all_point_sets = []
time_start = time.time()
for xs, ys in data:
point_set = []
# Convert points from strings to integers
y_iter = iter(ys.split(","))
curr_points = [Point(m_int(i), m_int(next(y_iter))) for i in xs.split(",")]
all_point_sets.append(curr_points)
time_end = time.time()
print "total time: ", (time_end - time_start)
The data is a tab separated file, which consists of lists of comma
separated integers.
Using sample get_data()
I made a .csv
file like this:
1,6,2,8,2,3,5,9,6,6 10,4,10,5,7,9,6,1,9,5
6,2,2,5,2,2,1,7,7,9 7,6,7,1,3,7,6,2,10,5
8,8,9,2,6,10,10,7,8,9 4,2,10,3,4,4,1,2,2,9
...
Then I abused C-optimized parsing via JSON:
def test2():
import json
import time
time_start = time.time()
with open('data.csv', 'rb') as f:
data = f.read()
data = '[[[' + ']],[['.join(data.splitlines()).replace('t', '],[') + ']]]'
all_point_sets = [Point(*xy) for row in json.loads(data) for xy in zip(*row)]
time_end = time.time()
print "total time: ", (time_end - time_start)
Results on my box: your original test()
~8s, with gc disabled ~6s, while my version (I/O included) gives ~6s and ~4s respectively. Ie about ~50% speed up. But looking at profiler data it’s obvious that biggest bottleneck is in object instantiation itself, so Matt Anderson‘s answer would net you the most gain on CPython.
Faster method, using Numpy (speedup of about 7x):
import numpy as np
txt = ','.join(','.join(row) for row in data)
arr = np.fromstring(txt, dtype=int, sep=',')
return arr.reshape(100000, 2, 10).transpose((0,2,1))
Performance comparison:
def load_1(data):
all_point_sets = []
gc.disable()
for xs, ys in data:
all_point_sets.append(zip(map(int, xs.split(',')), map(int, ys.split(','))))
gc.enable()
return all_point_sets
def load_2(data):
txt = ','.join(','.join(row) for row in data)
arr = np.fromstring(txt, dtype=int, sep=',')
return arr.reshape(100000, 2, 10).transpose((0,2,1))
load_1
runs in 1.52 seconds on my machine; load_2
runs in 0.20 seconds, a 7-fold improvement. The big caveat here is that it requires that you (1) know the lengths of everything in advance, and (2) that every row contains the exact same number of points. This is true for your get_data
output, but may not be true for your real dataset.
cython is able to speed things up by a factor of 5.5
$ python split.py
total time: 2.16252303123
total time: 0.393486022949
Here is the code I used
split.py
import time
import pyximport; pyximport.install()
from split_ import test_
def get_data(N=100000, M=10):
import random
data = []
for n in range(N):
pair = [[str(random.randint(1, 100)) for x in range(M)],
[str(random.randint(1, 100)) for x in range(M)]]
row = [",".join(pair[0]),
",".join(pair[1])]
data.append(row)
return data
class Point:
def __init__(self, a, b):
self.a = a
self.b = b
def test(data):
all_point_sets = []
for row in data:
point_set = []
first_points, second_points = row
# Convert points from strings to integers
first_points = map(int, first_points.split(","))
second_points = map(int, second_points.split(","))
paired_points = zip(first_points, second_points)
curr_points = [Point(p[0], p[1])
for p in paired_points]
all_point_sets.append(curr_points)
return all_point_sets
data = get_data()
for func in test, test_:
time_start = time.time()
res = func(data)
time_end = time.time()
print "total time: ", (time_end - time_start)
split_.pyx
from libc.string cimport strsep
from libc.stdlib cimport atoi
cdef class Point:
cdef public int a,b
def __cinit__(self, a, b):
self.a = a
self.b = b
def test_(data):
cdef char *xc, *yc, *xt, *yt
cdef char **xcp, **ycp
all_point_sets = []
for xs, ys in data:
xc = xs
xcp = &xc
yc = ys
ycp = &yc
point_set = []
while True:
xt = strsep(xcp, ',')
if xt is NULL:
break
yt = strsep(ycp, ",")
point_set.append(Point(atoi(xt), atoi(yt)))
all_point_sets.append(point_set)
return all_point_sets
poking around further I can approximately break down some of the cpu resources
5% strsep()
9% atoi()
23% creating Point instances
35% all_point_sets.append(point_set)
I would expect there may be an improvement if the cython was able to read from a csv(or whatever) file directly instead of having to trawl through a Python object.
As the time taken for built in functions such as zip(a,b)
or map(int, string.split(","))
for arrays of length 2000000 is negligible I have to presume that the most time consuming operation is append.
Thus the proper way to address the problem is to recursively concatenate the strings:
10 strings of 10 elements to a bigger string
10 strings of 100 elements
10 strings of 1000 elements
and finally to zip(map(int,huge_string_a.split(",")),map(int,huge_string_b.split(",")));
It’s then just fine tuning to find the optimal base N for the append and conquer method.
There are many good answers here. One side of this issue not addressed so far, however, is the list-to-string time cost differences between the various iterator implementations in python.
There is an essay testing the efficiency of different iterators with respect the list-to-string conversion on Python.org essays: list2str.
Bear in mind that when I ran into similar optimization problems, but with different data structure and sizes, the results presented in the essay did not all scale up at an equal rate, so it’s worthwhile testing the different iterator implementations for your particular use case.
How attached are you to having your coordinates accessible as .x
and .y
attributes? To my surprise, my tests show that the biggest single time sink was not the calls to list.append()
, but the construction of the Point
objects. They take four times as long to construct as a tuple, and there are a lot of them. Simply replacing Point(int(x), int(y))
with a tuple (int(x), int(y))
in your code shaved over 50% off the total execution time (Python 2.6 on Win XP). Perhaps your current code still has room to optimize this?
If you are really set on accessing the coordinates with .x
and .y
, you can try using collections.namedtuple
. It’s not as fast as plain tuples, but seems to be much faster than the Pair class in your code (I’m hedging because a separate timing benchmark gave me weird results).
Pair = namedtuple("Pair", "x y") # instead of the Point class
...
curr_points = [ Pair(x, y) for x, y in paired_points ]
If you need to go this route, it also pays off to derive a class from tuple (minimal cost over plain tuple). I can provide details if requested.
PS I see @MattAnderson mentioned the object-tuple issue long ago. But it’s a major effect (on my box at least), even before disabling garbage collection.
Original code: total time: 15.79
tuple instead of Point: total time: 7.328
namedtuple instead of Point: total time: 9.140