Splitting a string into an iterator

Question:

Does python have a build-in (meaning in the standard libraries) to do a split on strings that produces an iterator rather than a list? I have in mind working on very long strings and not needing to consume most of the string.

Asked By: pythonic metaphor

||

Answers:

Not directly splitting strings as such, but the re module has re.finditer() (and corresponding finditer() method on any compiled regular expression).

@Zero asked for an example:

>>> import re
>>> s = "The quick    brownnfox"
>>> for m in re.finditer('S+', s):
...     print(m.span(), m.group(0))
... 
(0, 3) The
(4, 9) quick
(13, 18) brown
(19, 22) fox
Answered By: Duncan

Like s.Lott, I don’t quite know what you want. Here is code that may help:

s = "This is a string."
for character in s:
    print character
for word in s.split(' '):
    print word

There are also s.index() and s.find() for finding the next character.


Later: Okay, something like this.

>>> def tokenizer(s, c):
...     i = 0
...     while True:
...         try:
...             j = s.index(c, i)
...         except ValueError:
...             yield s[i:]
...             return
...         yield s[i:j]
...         i = j + 1
... 
>>> for w in tokenizer(s, ' '):
...     print w
... 
This
is
a
string.
Answered By: hughdbrown

You could use something like SPARK (which has been absorbed into the Python distribution itself, though not importable from the standard library), but ultimately it uses regular expressions as well so Duncan’s answer would possibly serve you just as well if it was as easy as just "splitting on whitespace".

The other, far more arduous option would be to write your own Python module in C to do it if you really wanted speed, but that’s a far larger time investment of course.

Answered By: Daniel DiPaolo

If you don’t need to consume the whole string, that’s because you are looking for something specific, right? Then just look for that, with re or .find() instead of splitting. That way you can find the part of the string you are interested in, and split that.

Answered By: Lennart Regebro

Look at itertools. It contains things like takewhile, islice and groupby that allows you to slice an iterable — a string is iterable — into another iterable based on either indexes or a boolean condition of sorts.

Answered By: izak

There is no built-in iterator-based analog of str.split. Depending on your needs you could make a list iterator:

iterator = iter("abcdcba".split("b"))
iterator
# <list_iterator at 0x49159b0>
next(iterator)
# 'a'

However, a tool from this third-party library likely offers what you want, more_itertools.split_at. See also this post for an example.

Answered By: pylang

Here’s an isplit function, which behaves much like split – you can turn off the regex syntax with the regex argument. It uses the re.finditer function, and returns the strings “inbetween” the matches.

import re

def isplit(s, splitter=r's+', regex=True):
    if not regex:
        splitter = re.escape(splitter)

    start = 0

    for m in re.finditer(splitter, s):
        begin, end = m.span()
        if begin != start:
            yield s[start:begin]
        start = end

    if s[start:]:
        yield s[start:]


_examples = ['', 'a', 'a b', ' a  b c ', 'natb ']

def test_isplit():
    for example in _examples:
        assert list(isplit(example)) == example.split(), 'Wrong for {!r}: {} != {}'.format(
            example, list(isplit(example)), example.split()
        )
Answered By: Tomasz Gandor
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.