Iterate over a python sequence in multiples of n?
Question:
How do I process the elements of a sequence in batches, idiomatically?
For example, with the sequence “abcdef” and a batch size of 2, I would like to do something like the following:
for x, y in "abcdef":
print "%s%sn" % (x, y)
ab
cd
ef
Of course, this doesn’t work because it is expecting a single element from the list which itself contains 2 elements.
What is a nice, short, clean, pythonic way to process the next n elements of a list in a batch, or sub-strings of length n from a larger string (two similar problems)?
Answers:
One solution, although I challenge someone to do better 😉
a = 'abcdef'
b = [[a[i-1], a[i]] for i in range(1, len(a), 2)]
for x, y in b:
print "%s%sn" % (x, y)
I am sure someone is going to come up with some more “Pythonic” but how about:
for y in range(0, len(x), 2):
print "%s%s" % (x[y], x[y+1])
Note that this would only work if you know that len(x) % 2 == 0;
you can create the following generator
def chunks(seq, size):
a = range(0, len(seq), size)
b = range(size, len(seq) + 1, size)
for i, j in zip(a, b):
yield seq[i:j]
and use it like this:
for i in chunks('abcdef', 2):
print(i)
A generator function would be neat:
def batch_gen(data, batch_size):
for i in range(0, len(data), batch_size):
yield data[i:i+batch_size]
Example use:
a = "abcdef"
for i in batch_gen(a, 2): print i
prints:
ab
cd
ef
Don’t forget about the zip() function:
a = 'abcdef'
for x,y in zip(a[::2], a[1::2]):
print '%s%s' % (x,y)
but the more general way would be (inspired by this answer):
for i in zip(*(seq[i::size] for i in range(size))):
print(i) # tuple of individual values
I’ve got an alternative approach, that works for iterables that don’t have a known length.
def groupsgen(seq, size):
it = iter(seq)
while True:
values = ()
for n in xrange(size):
values += (it.next(),)
yield values
It works by iterating over the sequence (or other iterator) in groups of size, collecting the values in a tuple. At the end of each group, it yield the tuple.
When the iterator runs out of values, it produces a StopIteration exception which is then propagated up, indicating that groupsgen is out of values.
It assumes that the values come in sets of size (sets of 2, 3, etc). If not, any values left over are just discarded.
How about itertools?
from itertools import islice, groupby
def chunks_islice(seq, size):
while True:
aux = list(islice(seq, 0, size))
if not aux: break
yield "".join(aux)
def chunks_groupby(seq, size):
for k, chunk in groupby(enumerate(seq), lambda x: x[0] / size):
yield "".join([i[1] for i in chunk])
>>> a = "abcdef"
>>> size = 2
>>> [a[x:x+size] for x in range(0, len(a), size)]
['ab', 'cd', 'ef']
..or, not as a list comprehension:
a = "abcdef"
size = 2
output = []
for x in range(0, len(a), size):
output.append(a[x:x+size])
Or, as a generator, which would be best if used multiple times (for a one-use thing, the list comprehension is probably “best”):
def chunker(thelist, segsize):
for x in range(0, len(thelist), segsize):
yield thelist[x:x+segsize]
..and it’s usage:
>>> for seg in chunker(a, 2):
... print seg
...
ab
cd
ef
And then there’s always the documentation.
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
try:
b.next()
except StopIteration:
pass
return izip(a, b)
def grouper(n, iterable, padvalue=None):
"grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)
Note: these produce tuples instead of substrings, when given a string sequence as input.
s = 'abcdefgh'
for e in (s[i:i+2] for i in range(0,len(s),2)):
print(e)
The itertools doc has a recipe for this:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
Usage:
>>> l = [1,2,3,4,5,6,7,8,9]
>>> [z for z in grouper(l, 3)]
[(1, 2, 3), (4, 5, 6), (7, 8, 9)]
Except for two answers I saw a lot of premature materialization on the batches, and subscripting (which does not work for all iterators). Hence I came up with this alternative:
def iter_x_and_n(iterable, x, n):
yield x
try:
for _ in range(n):
yield next(iterable)
except StopIteration:
pass
def batched(iterable, n):
if n<1: raise ValueError("Can not create batches of size %d, number must be strictly positive" % n)
iterable = iter(iterable)
try:
for x in iterable:
yield iter_x_and_n(iterable, x, n-1)
except StopIteration:
pass
It beats me that there is no one-liner or few-liner solution for this (to the best of my knowleged). The key issue is that both the outer generator and the inner generator need to handle the StopIteration correctly. The outer generator should only yield something if there is at least one element left. The intuitive way to check this, is to execute a next(…) and catch a StopIteration.
From the docs of more_itertools: more_itertools.chunked()
more_itertools.chunked(iterable, n)
Break an iterable into lists of a given length:
>>> list(chunked([1, 2, 3, 4, 5, 6, 7], 3))
[[1, 2, 3], [4, 5, 6], [7]]
If the length of iterable is not evenly divisible by n, the last returned list will be shorter.
Given
from __future__ import print_function # python 2.x
seq = "abcdef"
n = 2
Code
while seq:
print("{}".format(seq[:n]), end="n")
seq = seq[n:]
Output
ab
cd
ef
Here is a solution, which yields a series of iterators, each of which iterates over n items.
def groupiter(thing, n):
def countiter(nextthing, thingiter, n):
yield nextthing
for _ in range(n - 1):
try:
nextitem = next(thingiter)
except StopIteration:
return
yield nextitem
thingiter = iter(thing)
while True:
try:
nextthing = next(thingiter)
except StopIteration:
return
yield countiter(nextthing, thingiter, n)
I use it as follows:
table = list(range(250))
for group in groupiter(table, 16):
print(' '.join('0x{:02X},'.format(x) for x in group))
Note that it can handle the length of the object not being a multiple of n.
Adapted from this answer for Python 3:
def groupsgen(seq, size):
it = iter(seq)
iterating = True
while iterating:
values = ()
try:
for n in range(size):
values += (next(it),)
except StopIteration:
iterating = False
if not len(values):
return None
yield values
It will safely terminate and won’t discard values if their number is not divisible by size
.
How do I process the elements of a sequence in batches, idiomatically?
For example, with the sequence “abcdef” and a batch size of 2, I would like to do something like the following:
for x, y in "abcdef":
print "%s%sn" % (x, y)
ab
cd
ef
Of course, this doesn’t work because it is expecting a single element from the list which itself contains 2 elements.
What is a nice, short, clean, pythonic way to process the next n elements of a list in a batch, or sub-strings of length n from a larger string (two similar problems)?
One solution, although I challenge someone to do better 😉
a = 'abcdef'
b = [[a[i-1], a[i]] for i in range(1, len(a), 2)]
for x, y in b:
print "%s%sn" % (x, y)
I am sure someone is going to come up with some more “Pythonic” but how about:
for y in range(0, len(x), 2):
print "%s%s" % (x[y], x[y+1])
Note that this would only work if you know that len(x) % 2 == 0;
you can create the following generator
def chunks(seq, size):
a = range(0, len(seq), size)
b = range(size, len(seq) + 1, size)
for i, j in zip(a, b):
yield seq[i:j]
and use it like this:
for i in chunks('abcdef', 2):
print(i)
A generator function would be neat:
def batch_gen(data, batch_size):
for i in range(0, len(data), batch_size):
yield data[i:i+batch_size]
Example use:
a = "abcdef"
for i in batch_gen(a, 2): print i
prints:
ab
cd
ef
Don’t forget about the zip() function:
a = 'abcdef'
for x,y in zip(a[::2], a[1::2]):
print '%s%s' % (x,y)
but the more general way would be (inspired by this answer):
for i in zip(*(seq[i::size] for i in range(size))):
print(i) # tuple of individual values
I’ve got an alternative approach, that works for iterables that don’t have a known length.
def groupsgen(seq, size):
it = iter(seq)
while True:
values = ()
for n in xrange(size):
values += (it.next(),)
yield values
It works by iterating over the sequence (or other iterator) in groups of size, collecting the values in a tuple. At the end of each group, it yield the tuple.
When the iterator runs out of values, it produces a StopIteration exception which is then propagated up, indicating that groupsgen is out of values.
It assumes that the values come in sets of size (sets of 2, 3, etc). If not, any values left over are just discarded.
How about itertools?
from itertools import islice, groupby
def chunks_islice(seq, size):
while True:
aux = list(islice(seq, 0, size))
if not aux: break
yield "".join(aux)
def chunks_groupby(seq, size):
for k, chunk in groupby(enumerate(seq), lambda x: x[0] / size):
yield "".join([i[1] for i in chunk])
>>> a = "abcdef"
>>> size = 2
>>> [a[x:x+size] for x in range(0, len(a), size)]
['ab', 'cd', 'ef']
..or, not as a list comprehension:
a = "abcdef"
size = 2
output = []
for x in range(0, len(a), size):
output.append(a[x:x+size])
Or, as a generator, which would be best if used multiple times (for a one-use thing, the list comprehension is probably “best”):
def chunker(thelist, segsize):
for x in range(0, len(thelist), segsize):
yield thelist[x:x+segsize]
..and it’s usage:
>>> for seg in chunker(a, 2):
... print seg
...
ab
cd
ef
And then there’s always the documentation.
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
try:
b.next()
except StopIteration:
pass
return izip(a, b)
def grouper(n, iterable, padvalue=None):
"grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)
Note: these produce tuples instead of substrings, when given a string sequence as input.
s = 'abcdefgh'
for e in (s[i:i+2] for i in range(0,len(s),2)):
print(e)
The itertools doc has a recipe for this:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
Usage:
>>> l = [1,2,3,4,5,6,7,8,9]
>>> [z for z in grouper(l, 3)]
[(1, 2, 3), (4, 5, 6), (7, 8, 9)]
Except for two answers I saw a lot of premature materialization on the batches, and subscripting (which does not work for all iterators). Hence I came up with this alternative:
def iter_x_and_n(iterable, x, n):
yield x
try:
for _ in range(n):
yield next(iterable)
except StopIteration:
pass
def batched(iterable, n):
if n<1: raise ValueError("Can not create batches of size %d, number must be strictly positive" % n)
iterable = iter(iterable)
try:
for x in iterable:
yield iter_x_and_n(iterable, x, n-1)
except StopIteration:
pass
It beats me that there is no one-liner or few-liner solution for this (to the best of my knowleged). The key issue is that both the outer generator and the inner generator need to handle the StopIteration correctly. The outer generator should only yield something if there is at least one element left. The intuitive way to check this, is to execute a next(…) and catch a StopIteration.
From the docs of more_itertools: more_itertools.chunked()
more_itertools.chunked(iterable, n)
Break an iterable into lists of a given length:
>>> list(chunked([1, 2, 3, 4, 5, 6, 7], 3))
[[1, 2, 3], [4, 5, 6], [7]]
If the length of iterable is not evenly divisible by n, the last returned list will be shorter.
Given
from __future__ import print_function # python 2.x
seq = "abcdef"
n = 2
Code
while seq:
print("{}".format(seq[:n]), end="n")
seq = seq[n:]
Output
ab
cd
ef
Here is a solution, which yields a series of iterators, each of which iterates over n items.
def groupiter(thing, n):
def countiter(nextthing, thingiter, n):
yield nextthing
for _ in range(n - 1):
try:
nextitem = next(thingiter)
except StopIteration:
return
yield nextitem
thingiter = iter(thing)
while True:
try:
nextthing = next(thingiter)
except StopIteration:
return
yield countiter(nextthing, thingiter, n)
I use it as follows:
table = list(range(250))
for group in groupiter(table, 16):
print(' '.join('0x{:02X},'.format(x) for x in group))
Note that it can handle the length of the object not being a multiple of n.
Adapted from this answer for Python 3:
def groupsgen(seq, size):
it = iter(seq)
iterating = True
while iterating:
values = ()
try:
for n in range(size):
values += (next(it),)
except StopIteration:
iterating = False
if not len(values):
return None
yield values
It will safely terminate and won’t discard values if their number is not divisible by size
.