Splitting a string into an iterator
Question:
Does python have a build-in (meaning in the standard libraries) to do a split on strings that produces an iterator rather than a list? I have in mind working on very long strings and not needing to consume most of the string.
Answers:
Not directly splitting strings as such, but the re
module has re.finditer()
(and corresponding finditer()
method on any compiled regular expression).
@Zero asked for an example:
>>> import re
>>> s = "The quick brownnfox"
>>> for m in re.finditer('S+', s):
... print(m.span(), m.group(0))
...
(0, 3) The
(4, 9) quick
(13, 18) brown
(19, 22) fox
Like s.Lott, I don’t quite know what you want. Here is code that may help:
s = "This is a string."
for character in s:
print character
for word in s.split(' '):
print word
There are also s.index() and s.find() for finding the next character.
Later: Okay, something like this.
>>> def tokenizer(s, c):
... i = 0
... while True:
... try:
... j = s.index(c, i)
... except ValueError:
... yield s[i:]
... return
... yield s[i:j]
... i = j + 1
...
>>> for w in tokenizer(s, ' '):
... print w
...
This
is
a
string.
You could use something like SPARK (which has been absorbed into the Python distribution itself, though not importable from the standard library), but ultimately it uses regular expressions as well so Duncan’s answer would possibly serve you just as well if it was as easy as just "splitting on whitespace".
The other, far more arduous option would be to write your own Python module in C to do it if you really wanted speed, but that’s a far larger time investment of course.
If you don’t need to consume the whole string, that’s because you are looking for something specific, right? Then just look for that, with re
or .find()
instead of splitting. That way you can find the part of the string you are interested in, and split that.
Look at itertools
. It contains things like takewhile
, islice
and groupby
that allows you to slice an iterable — a string is iterable — into another iterable based on either indexes or a boolean condition of sorts.
There is no built-in iterator-based analog of str.split
. Depending on your needs you could make a list iterator:
iterator = iter("abcdcba".split("b"))
iterator
# <list_iterator at 0x49159b0>
next(iterator)
# 'a'
However, a tool from this third-party library likely offers what you want, more_itertools.split_at
. See also this post for an example.
Here’s an isplit
function, which behaves much like split – you can turn off the regex syntax with the regex
argument. It uses the re.finditer
function, and returns the strings “inbetween” the matches.
import re
def isplit(s, splitter=r's+', regex=True):
if not regex:
splitter = re.escape(splitter)
start = 0
for m in re.finditer(splitter, s):
begin, end = m.span()
if begin != start:
yield s[start:begin]
start = end
if s[start:]:
yield s[start:]
_examples = ['', 'a', 'a b', ' a b c ', 'natb ']
def test_isplit():
for example in _examples:
assert list(isplit(example)) == example.split(), 'Wrong for {!r}: {} != {}'.format(
example, list(isplit(example)), example.split()
)
Does python have a build-in (meaning in the standard libraries) to do a split on strings that produces an iterator rather than a list? I have in mind working on very long strings and not needing to consume most of the string.
Not directly splitting strings as such, but the re
module has re.finditer()
(and corresponding finditer()
method on any compiled regular expression).
@Zero asked for an example:
>>> import re
>>> s = "The quick brownnfox"
>>> for m in re.finditer('S+', s):
... print(m.span(), m.group(0))
...
(0, 3) The
(4, 9) quick
(13, 18) brown
(19, 22) fox
Like s.Lott, I don’t quite know what you want. Here is code that may help:
s = "This is a string."
for character in s:
print character
for word in s.split(' '):
print word
There are also s.index() and s.find() for finding the next character.
Later: Okay, something like this.
>>> def tokenizer(s, c):
... i = 0
... while True:
... try:
... j = s.index(c, i)
... except ValueError:
... yield s[i:]
... return
... yield s[i:j]
... i = j + 1
...
>>> for w in tokenizer(s, ' '):
... print w
...
This
is
a
string.
You could use something like SPARK (which has been absorbed into the Python distribution itself, though not importable from the standard library), but ultimately it uses regular expressions as well so Duncan’s answer would possibly serve you just as well if it was as easy as just "splitting on whitespace".
The other, far more arduous option would be to write your own Python module in C to do it if you really wanted speed, but that’s a far larger time investment of course.
If you don’t need to consume the whole string, that’s because you are looking for something specific, right? Then just look for that, with re
or .find()
instead of splitting. That way you can find the part of the string you are interested in, and split that.
Look at itertools
. It contains things like takewhile
, islice
and groupby
that allows you to slice an iterable — a string is iterable — into another iterable based on either indexes or a boolean condition of sorts.
There is no built-in iterator-based analog of str.split
. Depending on your needs you could make a list iterator:
iterator = iter("abcdcba".split("b"))
iterator
# <list_iterator at 0x49159b0>
next(iterator)
# 'a'
However, a tool from this third-party library likely offers what you want, more_itertools.split_at
. See also this post for an example.
Here’s an isplit
function, which behaves much like split – you can turn off the regex syntax with the regex
argument. It uses the re.finditer
function, and returns the strings “inbetween” the matches.
import re
def isplit(s, splitter=r's+', regex=True):
if not regex:
splitter = re.escape(splitter)
start = 0
for m in re.finditer(splitter, s):
begin, end = m.span()
if begin != start:
yield s[start:begin]
start = end
if s[start:]:
yield s[start:]
_examples = ['', 'a', 'a b', ' a b c ', 'natb ']
def test_isplit():
for example in _examples:
assert list(isplit(example)) == example.split(), 'Wrong for {!r}: {} != {}'.format(
example, list(isplit(example)), example.split()
)