Fast way to split alpha and numeric chars in a python string
Question:
I am trying to work out a simple function to capture typos, e.g:
"Westminister15"
"Westminister15London"
"23Westminister15London"
after fixating:
["Westminister", "15"]
["Westminister", "15", "London"]
["23", "Westminister", "15", "London"]
First attempt:
def fixate(query):
digit_pattern = re.compile(r'D')
alpha_pattern = re.compile(r'd')
digits = filter(None, digit_pattern.split(query))
alphas = filter(None, alpha_pattern.split(query))
print digits
print alphas
result:
fixate("Westminister15London")
> ['15']
> ['Westminister', 'London']
However, I think this could be done more effectively, and I still get bad results when I try something like:
fixate("Westminister15London England")
> ['15']
> ['Westminister', 'London England']
Obviously it should enlist London
and England
separately, but I feel my function will get overly patched and theres a simpler approach
This question is somewhat equivalent to this php question
Answers:
You can get the desired result with re.findall()
:
>>> re.findall(r"[^Wd_]+|d+", "23Westminister15London")
['23', 'Westminister', '15', 'London']
>>> re.findall(r"[^Wd_]+|d+", "Westminister15London England")
['Westminister', '15', 'London', 'England']
d+
matches any number of digits, [^Wd_]+
matches any word.
re.split()
would also be possible in current Python versions since splits on zero-length matches are now supported, but the resulting regex is much more complicated, so I still recommend the old approach.
You can use this regex instead of yours:
>>> import re
>>> regex = re.compile(r'(d+|s+)')
>>> regex.split('Westminister15')
['Westminister', '15', '']
>>> regex.split('Westminister15London England')
['Westminister', '15', 'London', ' ', 'England']
>>>
Then you have to filter the list removing empty strings/white-space only strings.
Here’s another approach in case you prefer to stay away from regex, which sometimes can be unwieldy if one is not familiar enough to make it/change it themselves:
from itertools import groupby
def split_text(s):
for k, g in groupby(s, str.isalpha):
yield ''.join(g)
print(list(split_text("Westminister15")))
print(list(split_text("Westminister15London")))
print(list(split_text("23Westminister15London")))
print(list(split_text("Westminister15London England")))
returns:
['Westminister', '15']
['Westminister', '15', 'London']
['23', 'Westminister', '15', 'London']
['Westminister', '15', 'London', ' ', 'England']
The generator can be easily modified, too, to never yield whitespace strings if desired.
I am trying to work out a simple function to capture typos, e.g:
"Westminister15"
"Westminister15London"
"23Westminister15London"
after fixating:
["Westminister", "15"]
["Westminister", "15", "London"]
["23", "Westminister", "15", "London"]
First attempt:
def fixate(query):
digit_pattern = re.compile(r'D')
alpha_pattern = re.compile(r'd')
digits = filter(None, digit_pattern.split(query))
alphas = filter(None, alpha_pattern.split(query))
print digits
print alphas
result:
fixate("Westminister15London")
> ['15']
> ['Westminister', 'London']
However, I think this could be done more effectively, and I still get bad results when I try something like:
fixate("Westminister15London England")
> ['15']
> ['Westminister', 'London England']
Obviously it should enlist London
and England
separately, but I feel my function will get overly patched and theres a simpler approach
This question is somewhat equivalent to this php question
You can get the desired result with re.findall()
:
>>> re.findall(r"[^Wd_]+|d+", "23Westminister15London")
['23', 'Westminister', '15', 'London']
>>> re.findall(r"[^Wd_]+|d+", "Westminister15London England")
['Westminister', '15', 'London', 'England']
d+
matches any number of digits, [^Wd_]+
matches any word.
re.split()
would also be possible in current Python versions since splits on zero-length matches are now supported, but the resulting regex is much more complicated, so I still recommend the old approach.
You can use this regex instead of yours:
>>> import re
>>> regex = re.compile(r'(d+|s+)')
>>> regex.split('Westminister15')
['Westminister', '15', '']
>>> regex.split('Westminister15London England')
['Westminister', '15', 'London', ' ', 'England']
>>>
Then you have to filter the list removing empty strings/white-space only strings.
Here’s another approach in case you prefer to stay away from regex, which sometimes can be unwieldy if one is not familiar enough to make it/change it themselves:
from itertools import groupby
def split_text(s):
for k, g in groupby(s, str.isalpha):
yield ''.join(g)
print(list(split_text("Westminister15")))
print(list(split_text("Westminister15London")))
print(list(split_text("23Westminister15London")))
print(list(split_text("Westminister15London England")))
returns:
['Westminister', '15']
['Westminister', '15', 'London']
['23', 'Westminister', '15', 'London']
['Westminister', '15', 'London', ' ', 'England']
The generator can be easily modified, too, to never yield whitespace strings if desired.