Split a string "aabbcc" -> ["aa", "bb", "cc"] without re.split
Question:
I would like to split a string according to the title in a single call. I’m looking for a simple syntax using list comprehension, but i don’t got it yet:
s = "123456"
And the result would be:
["12", "34", "56"]
What i don’t want:
re.split('(?i)([0-9a-f]{2})', s)
s[0:2], s[2:4], s[4:6]
[s[i*2:i*2+2] for i in len(s) / 2]
Edit:
Ok, i wanted to parse a hex RGB[A] color (and possible other color/component format), to extract all the component.
It seem that the fastest approach would be the last from sven-marnach:
-
sven-marnach xrange: 0.883 usec per loop
python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
-
pair/iter: 1.38 usec per loop
python -m timeit -s 's="aabbcc"' '["%c%c" % pair for pair in zip(* 2 * [iter(s)])]'
-
Regex: 2.55 usec per loop
python -m timeit -s 'import re; s="aabbcc"; c=re.compile("(?i)([0-9a-f]{2})");
split=re.split' '[int(x, 16) / 255. for x in split(c, s) if x != ""]'
Answers:
In [4]: ["".join(pair) for pair in zip(* 2 * [iter(s)])]
Out[4]: ['aa', 'bb', 'cc']
See: How does zip(*[iter(s)]*n) work in Python? for explanations as to that strange “2-iter
over the same str
” syntax.
You say in the comments that you want to “have the fastest execution”, I can’t promise you that with this implementation, but you can measure the execution using timeit
. Remember what Donald Knuth said about premature optimisation, of course. For the problem at hand (now that you’ve revealed it) I think you’d find r, g, b = s[0:2], s[2:4], s[4:6]
hard to beat.
$ python3.2 -m timeit -c '
s = "aabbcc"
["".join(pair) for pair in zip(* 2 * [iter(s)])]
'
100000 loops, best of 3: 4.49 usec per loop
Cf.
python3.2 -m timeit -c '
s = "aabbcc"
r, g, b = s[0:2], s[2:4], s[4:6]
'
1000000 loops, best of 3: 1.2 usec per loop
Reading through the comments, it turns out the actual question is: What is the fastest way to parse a color definition string in hexadecimal RRGGBBAA
format. Here are some options:
def rgba1(s, unpack=struct.unpack):
return unpack("BBBB", s.decode("hex"))
def rgba2(s, int=int, xrange=xrange):
return [int(s[i:i+2], 16) for i in xrange(0, 8, 2)]
def rgba3(s, int=int, xrange=xrange):
x = int(s, 16)
return [(x >> i) & 255 for i in xrange(0, 32, 8)]
As I expected, the first version turns out to be fastest:
In [6]: timeit rgba1("aabbccdd")
1000000 loops, best of 3: 1.44 us per loop
In [7]: timeit rgba2("aabbccdd")
100000 loops, best of 3: 2.43 us per loop
In [8]: timeit rgba3("aabbccdd")
100000 loops, best of 3: 2.44 us per loop
Numpy is worse than your preferred solution for a single lookup:
$ python -m timeit -s 'import numpy as np; s="aabbccdd"' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; list(a)'
100000 loops, best of 3: 5.14 usec per loop
$ python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
100000 loops, best of 3: 2.41 usec per loop
But if you do several conversions at once, numpy is much faster:
$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.tolist()'
10000 loops, best of 3: 59.6 usec per loop
$ python -m timeit -s 's="aabbccdd" * 100;' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
1000 loops, best of 3: 240 usec per loop
Numpy is faster for batcher larger than 2, on my computer. You can easily group the values by setting a.shape
to (number_of_colors, 4)
, though it makes the tolist
method 50% slower.
In fact, most of the time is spent converting the array to a list. Depending on what you wish to do with the results, you may be able to skip this intermeditary step, and reap some benefits:
$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.shape = (100,4)'
100000 loops, best of 3: 6.76 usec per loop
I would like to split a string according to the title in a single call. I’m looking for a simple syntax using list comprehension, but i don’t got it yet:
s = "123456"
And the result would be:
["12", "34", "56"]
What i don’t want:
re.split('(?i)([0-9a-f]{2})', s)
s[0:2], s[2:4], s[4:6]
[s[i*2:i*2+2] for i in len(s) / 2]
Edit:
Ok, i wanted to parse a hex RGB[A] color (and possible other color/component format), to extract all the component.
It seem that the fastest approach would be the last from sven-marnach:
-
sven-marnach xrange: 0.883 usec per loop
python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
-
pair/iter: 1.38 usec per loop
python -m timeit -s 's="aabbcc"' '["%c%c" % pair for pair in zip(* 2 * [iter(s)])]'
-
Regex: 2.55 usec per loop
python -m timeit -s 'import re; s="aabbcc"; c=re.compile("(?i)([0-9a-f]{2})"); split=re.split' '[int(x, 16) / 255. for x in split(c, s) if x != ""]'
In [4]: ["".join(pair) for pair in zip(* 2 * [iter(s)])]
Out[4]: ['aa', 'bb', 'cc']
See: How does zip(*[iter(s)]*n) work in Python? for explanations as to that strange “2-iter
over the same str
” syntax.
You say in the comments that you want to “have the fastest execution”, I can’t promise you that with this implementation, but you can measure the execution using timeit
. Remember what Donald Knuth said about premature optimisation, of course. For the problem at hand (now that you’ve revealed it) I think you’d find r, g, b = s[0:2], s[2:4], s[4:6]
hard to beat.
$ python3.2 -m timeit -c '
s = "aabbcc"
["".join(pair) for pair in zip(* 2 * [iter(s)])]
'
100000 loops, best of 3: 4.49 usec per loop
Cf.
python3.2 -m timeit -c '
s = "aabbcc"
r, g, b = s[0:2], s[2:4], s[4:6]
'
1000000 loops, best of 3: 1.2 usec per loop
Reading through the comments, it turns out the actual question is: What is the fastest way to parse a color definition string in hexadecimal RRGGBBAA
format. Here are some options:
def rgba1(s, unpack=struct.unpack):
return unpack("BBBB", s.decode("hex"))
def rgba2(s, int=int, xrange=xrange):
return [int(s[i:i+2], 16) for i in xrange(0, 8, 2)]
def rgba3(s, int=int, xrange=xrange):
x = int(s, 16)
return [(x >> i) & 255 for i in xrange(0, 32, 8)]
As I expected, the first version turns out to be fastest:
In [6]: timeit rgba1("aabbccdd")
1000000 loops, best of 3: 1.44 us per loop
In [7]: timeit rgba2("aabbccdd")
100000 loops, best of 3: 2.43 us per loop
In [8]: timeit rgba3("aabbccdd")
100000 loops, best of 3: 2.44 us per loop
Numpy is worse than your preferred solution for a single lookup:
$ python -m timeit -s 'import numpy as np; s="aabbccdd"' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; list(a)'
100000 loops, best of 3: 5.14 usec per loop
$ python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
100000 loops, best of 3: 2.41 usec per loop
But if you do several conversions at once, numpy is much faster:
$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.tolist()'
10000 loops, best of 3: 59.6 usec per loop
$ python -m timeit -s 's="aabbccdd" * 100;' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
1000 loops, best of 3: 240 usec per loop
Numpy is faster for batcher larger than 2, on my computer. You can easily group the values by setting a.shape
to (number_of_colors, 4)
, though it makes the tolist
method 50% slower.
In fact, most of the time is spent converting the array to a list. Depending on what you wish to do with the results, you may be able to skip this intermeditary step, and reap some benefits:
$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.shape = (100,4)'
100000 loops, best of 3: 6.76 usec per loop