Split a string "aabbcc" -> ["aa", "bb", "cc"] without re.split

Question:

I would like to split a string according to the title in a single call. I’m looking for a simple syntax using list comprehension, but i don’t got it yet:

s = "123456"

And the result would be:

["12", "34", "56"]

What i don’t want:

re.split('(?i)([0-9a-f]{2})', s)
s[0:2], s[2:4], s[4:6]
[s[i*2:i*2+2] for i in len(s) / 2]

Edit:

Ok, i wanted to parse a hex RGB[A] color (and possible other color/component format), to extract all the component.
It seem that the fastest approach would be the last from sven-marnach:

  1. sven-marnach xrange: 0.883 usec per loop

    python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
    
  2. pair/iter: 1.38 usec per loop

    python -m timeit -s 's="aabbcc"' '["%c%c" % pair for pair in zip(* 2 * [iter(s)])]'
    
  3. Regex: 2.55 usec per loop

    python -m timeit -s 'import re; s="aabbcc"; c=re.compile("(?i)([0-9a-f]{2})"); 
    split=re.split' '[int(x, 16) / 255. for x in split(c, s) if x != ""]'
    
Asked By: tito

||

Answers:

In [4]: ["".join(pair) for pair in zip(* 2 * [iter(s)])]
Out[4]: ['aa', 'bb', 'cc']

See: How does zip(*[iter(s)]*n) work in Python? for explanations as to that strange “2-iter over the same str” syntax.


You say in the comments that you want to “have the fastest execution”, I can’t promise you that with this implementation, but you can measure the execution using timeit. Remember what Donald Knuth said about premature optimisation, of course. For the problem at hand (now that you’ve revealed it) I think you’d find r, g, b = s[0:2], s[2:4], s[4:6] hard to beat.

$ python3.2 -m timeit -c '
s = "aabbcc"
["".join(pair) for pair in zip(* 2 * [iter(s)])]
'
100000 loops, best of 3: 4.49 usec per loop

Cf.

python3.2 -m timeit -c '
s = "aabbcc"
r, g, b = s[0:2], s[2:4], s[4:6]
'
1000000 loops, best of 3: 1.2 usec per loop
Answered By: johnsyweb

Reading through the comments, it turns out the actual question is: What is the fastest way to parse a color definition string in hexadecimal RRGGBBAA format. Here are some options:

def rgba1(s, unpack=struct.unpack):
    return unpack("BBBB", s.decode("hex"))

def rgba2(s, int=int, xrange=xrange):
    return [int(s[i:i+2], 16) for i in xrange(0, 8, 2)]

def rgba3(s, int=int, xrange=xrange):
    x = int(s, 16)
    return [(x >> i) & 255 for i in xrange(0, 32, 8)]

As I expected, the first version turns out to be fastest:

In [6]: timeit rgba1("aabbccdd")
1000000 loops, best of 3: 1.44 us per loop

In [7]: timeit rgba2("aabbccdd")
100000 loops, best of 3: 2.43 us per loop

In [8]: timeit rgba3("aabbccdd")
100000 loops, best of 3: 2.44 us per loop
Answered By: Sven Marnach

Numpy is worse than your preferred solution for a single lookup:

$ python -m timeit -s 'import numpy as np; s="aabbccdd"' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; list(a)'
100000 loops, best of 3: 5.14 usec per loop
$ python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
100000 loops, best of 3: 2.41 usec per loop

But if you do several conversions at once, numpy is much faster:

$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.tolist()'
10000 loops, best of 3: 59.6 usec per loop
$ python -m timeit -s 's="aabbccdd" * 100;' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
1000 loops, best of 3: 240 usec per loop

Numpy is faster for batcher larger than 2, on my computer. You can easily group the values by setting a.shape to (number_of_colors, 4), though it makes the tolist method 50% slower.

In fact, most of the time is spent converting the array to a list. Depending on what you wish to do with the results, you may be able to skip this intermeditary step, and reap some benefits:

$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.shape = (100,4)'
100000 loops, best of 3: 6.76 usec per loop
Answered By: Lauritz V. Thaulow
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.