Splitting a string into an iterator

Question

Does python have a build-in (meaning in the standard libraries) to do a split on strings that produces an iterator rather than a list? I have in mind working on very long strings and not needing to consume most of the string.

Asked By: pythonic metaphor

||

Source

Answer 1

Not directly splitting strings as such, but the re module has re.finditer() (and corresponding finditer() method on any compiled regular expression).

@Zero asked for an example:

>>> import re
>>> s = "The quick    brownnfox"
>>> for m in re.finditer('S+', s):
...     print(m.span(), m.group(0))
... 
(0, 3) The
(4, 9) quick
(13, 18) brown
(19, 22) fox

Answered By: Duncan

Answer 2

Like s.Lott, I don’t quite know what you want. Here is code that may help:

s = "This is a string."
for character in s:
    print character
for word in s.split(' '):
    print word

There are also s.index() and s.find() for finding the next character.

Later: Okay, something like this.

>>> def tokenizer(s, c):
...     i = 0
...     while True:
...         try:
...             j = s.index(c, i)
...         except ValueError:
...             yield s[i:]
...             return
...         yield s[i:j]
...         i = j + 1
... 
>>> for w in tokenizer(s, ' '):
...     print w
... 
This
is
a
string.

Answered By: hughdbrown

Answer 3

You could use something like SPARK (which has been absorbed into the Python distribution itself, though not importable from the standard library), but ultimately it uses regular expressions as well so Duncan’s answer would possibly serve you just as well if it was as easy as just "splitting on whitespace".

The other, far more arduous option would be to write your own Python module in C to do it if you really wanted speed, but that’s a far larger time investment of course.

Answered By: Daniel DiPaolo

Answer 4

If you don’t need to consume the whole string, that’s because you are looking for something specific, right? Then just look for that, with re or .find() instead of splitting. That way you can find the part of the string you are interested in, and split that.

Answered By: Lennart Regebro

Answer 5

Look at itertools. It contains things like takewhile, islice and groupby that allows you to slice an iterable — a string is iterable — into another iterable based on either indexes or a boolean condition of sorts.

Answered By: izak

Answer 6

There is no built-in iterator-based analog of str.split. Depending on your needs you could make a list iterator:

iterator = iter("abcdcba".split("b"))
iterator
# <list_iterator at 0x49159b0>
next(iterator)
# 'a'

However, a tool from this third-party library likely offers what you want, more_itertools.split_at. See also this post for an example.

Answered By: pylang

Answer 7

Here’s an isplit function, which behaves much like split – you can turn off the regex syntax with the regex argument. It uses the re.finditer function, and returns the strings “inbetween” the matches.

import re

def isplit(s, splitter=r's+', regex=True):
    if not regex:
        splitter = re.escape(splitter)

    start = 0

    for m in re.finditer(splitter, s):
        begin, end = m.span()
        if begin != start:
            yield s[start:begin]
        start = end

    if s[start:]:
        yield s[start:]


_examples = ['', 'a', 'a b', ' a  b c ', 'natb ']

def test_isplit():
    for example in _examples:
        assert list(isplit(example)) == example.split(), 'Wrong for {!r}: {} != {}'.format(
            example, list(isplit(example)), example.split()
        )

Answered By: Tomasz Gandor

Splitting a string into an iterator

Question:

Answers: