What is the most efficient way in Python to convert a string to all lowercase stripping out all non-ascii alpha characters?
Question:
I have a simple task I need to perform in Python, which is to convert a string to all lowercase and strip out all non-ascii non-alpha characters.
For example:
"This is a Test" -> "thisisatest"
"A235th@#$&( er Ra{}|?>ndom" -> "atherrandom"
I have a simple function to do this:
import string
import sys
def strip_string_to_lowercase(s):
tmpStr = s.lower().strip()
retStrList = []
for x in tmpStr:
if x in string.ascii_lowercase:
retStrList.append(x)
return ''.join(retStrList)
But I cannot help thinking there is a more efficient, or more elegant, way.
Thanks!
Edit:
Thanks to all those that answered. I learned, and in some cases re-learned, a good deal of python.
Answers:
I would:
- lowercase the string
- replace all
[^a-z]
with ""
Like that:
def strip_string_to_lowercase():
nonascii = re.compile('[^a-z]')
return lambda s: nonascii.sub('', s.lower().strip())
EDIT: It turns out that the original version (below) is really slow, though some performance can be gained by converting it into a closure (above).
def strip_string_to_lowercase(s):
return re.sub('[^a-z]', '', s.lower().strip())
My performance measurements with 100,000 iterations against the string
"A235th@#$&( er Ra{}|?>ndom"
revealed that:
f_re_0 took 2672.000 ms
(this is the original version of this answer)
f_re_1 took 2109.000 ms
(this is the closure version shown above)
f_re_2 took 2031.000 ms
(the closure version, without the redundant strip()
)
f_fl_1 took 1953.000 ms
(unwind’s filter
/lambda
version)
f_fl_2 took 1485.000 ms
(Coady’s filter
version)
f_jn_1 took 1860.000 ms
(Dana’s join
version)
For the sake of the test, I did not print
the results.
Not especially runtime efficient, but certainly nicer on poor, tired coder eyes:
def strip_string_and_lowercase(s):
return ''.join(c for c in s.lower() if c in string.ascii_lowercase)
>>> import string
>>> a = "O235th@#$&( er Ra{}|?<ndom"
>>> ''.join(i for i in a.lower() if i in string.ascii_lowercase)
'otheraltndom'
doing essentially the same as you.
This is a typical application of list compehension:
import string
s = "O235th@#$&( er Ra{}|?<ndom"
print ''.join(c for c in s.lower() if c in string.ascii_lowercase)
It won’t filter out “<” (html entity), as in your example, but I assume that was accidental cut and past problem.
Personally I would use a regular expression and then convert the final string to lower case. I have no idea how to write it in Python, but the basic idea is to:
- Remove characters in string that don’t match case-insensitive regex “
w
“
- Convert string to lower-case
or vice versa.
Similar to @Dana’s, but I think this sounds like a filtering job, and that should be visible in the code. Also without the need to explicitly call join()
:
def strip_string_to_lowercase(s):
return filter(lambda x: x in string.ascii_lowercase, s.lower())
>>> filter(str.isalpha, "This is a Test").lower()
'thisisatest'
>>> filter(str.isalpha, "A235th@#$&( er Ra{}|?>ndom").lower()
'atherrandom'
Another solution (not that pythonic, but very fast) is to use string.translate – though note that this will not work for unicode. It’s also worth noting that you can speed up Dana’s code by moving the characters into a set (which looks up by hash, rather than performing a linear search each time). Here are the timings I get for various of the solutions given:
import string, re, timeit
# Precomputed values (for str_join_set and translate)
letter_set = frozenset(string.ascii_lowercase + string.ascii_uppercase)
tab = string.maketrans(string.ascii_lowercase + string.ascii_uppercase,
string.ascii_lowercase * 2)
deletions = ''.join(ch for ch in map(chr,range(256)) if ch not in letter_set)
s="A235th@#$&( er Ra{}|?>ndom"
# From unwind's filter approach
def test_filter(s):
return filter(lambda x: x in string.ascii_lowercase, s.lower())
# using set instead (and contains)
def test_filter_set(s):
return filter(letter_set.__contains__, s).lower()
# Tomalak's solution
def test_regex(s):
return re.sub('[^a-z]', '', s.lower())
# Dana's
def test_str_join(s):
return ''.join(c for c in s.lower() if c in string.ascii_lowercase)
# Modified to use a set.
def test_str_join_set(s):
return ''.join(c for c in s.lower() if c in letter_set)
# Translate approach.
def test_translate(s):
return string.translate(s, tab, deletions)
for test in sorted(globals()):
if test.startswith("test_"):
assert globals()[test](s)=='atherrandom'
print "%30s : %s" % (test, timeit.Timer("f(s)",
"from __main__ import %s as f, s" % test).timeit(200000))
This gives me:
test_filter : 2.57138351271
test_filter_set : 0.981806765698
test_regex : 3.10069885233
test_str_join : 2.87172979743
test_str_join_set : 2.43197956381
test_translate : 0.335367566218
[Edit] Updated with filter solutions as well. (Note that using set.__contains__
makes a big difference here, as it avoids making an extra function call for the lambda.
I added the filter solutions to Brian’s code:
import string, re, timeit
# Precomputed values (for str_join_set and translate)
letter_set = frozenset(string.ascii_lowercase + string.ascii_uppercase)
tab = string.maketrans(string.ascii_lowercase + string.ascii_uppercase,
string.ascii_lowercase * 2)
deletions = ''.join(ch for ch in map(chr,range(256)) if ch not in letter_set)
s="A235th@#$&( er Ra{}|?>ndom"
def test_original(s):
tmpStr = s.lower().strip()
retStrList = []
for x in tmpStr:
if x in string.ascii_lowercase:
retStrList.append(x)
return ''.join(retStrList)
def test_regex(s):
return re.sub('[^a-z]', '', s.lower())
def test_regex_closure(s):
nonascii = re.compile('[^a-z]')
def replacer(s):
return nonascii.sub('', s.lower().strip())
return replacer(s)
def test_str_join(s):
return ''.join(c for c in s.lower() if c in string.ascii_lowercase)
def test_str_join_set(s):
return ''.join(c for c in s.lower() if c in letter_set)
def test_filter_set(s):
return filter(letter_set.__contains__, s.lower())
def test_filter_isalpha(s):
return filter(str.isalpha, s).lower()
def test_filter_lambda(s):
return filter(lambda x: x in string.ascii_lowercase, s.lower())
def test_translate(s):
return string.translate(s, tab, deletions)
for test in sorted(globals()):
if test.startswith("test_"):
print "%30s : %s" % (test, timeit.Timer("f(s)",
"from __main__ import %s as f, s" % test).timeit(200000))
This gives me:
test_filter_isalpha : 1.31981746283
test_filter_lambda : 2.23935583992
test_filter_set : 0.76511679557
test_original : 2.13079176264
test_regex : 2.44295629752
test_regex_closure : 2.65205913042
test_str_join : 2.25571266739
test_str_join_set : 1.75565888961
test_translate : 0.269259640541
It appears that isalpha is using a similar algorithm, at least in terms of O(), to the set algorithm.
Edit:
Added the filter set, and renamed the filter functions to be a little more clear.
Python 2.x translate
method
Convert to lowercase and filter non-ascii non-alpha characters:
from string import ascii_letters, ascii_lowercase, maketrans
table = maketrans(ascii_letters, ascii_lowercase*2)
deletechars = ''.join(set(maketrans('','')) - set(ascii_letters))
print "A235th@#$&( er Ra{}|?>ndom".translate(table, deletechars)
# -> 'atherrandom'
Python 3 translate
method
Filter non-ascii:
ascii_bytes = "A235th@#$&(٠٫٢٥ er Ra{}|?>ndom".encode('ascii', 'ignore')
Use bytes.translate()
to convert to lowercase and delete non-alpha bytes:
from string import ascii_letters, ascii_lowercase
alpha, lower = [s.encode('ascii') for s in [ascii_letters, ascii_lowercase]]
table = bytes.maketrans(alpha, lower*2) # convert to lowercase
deletebytes = bytes(set(range(256)) - set(alpha)) # delete nonalpha
print(ascii_bytes.translate(table, deletebytes))
# -> b'atherrandom'
Python 2.x:
import string
valid_chars= string.ascii_lowercase + string.ascii_uppercase
def only_lower_ascii_alpha(text):
return filter(valid_chars.__contains__, text).lower()
Works with either str
or unicode
arguments.
>>> only_lower_ascii_alpha("Hello there 123456!")
'hellothere'
>>> only_lower_ascii_alpha(u"435 café")
u'caf'
I have a simple task I need to perform in Python, which is to convert a string to all lowercase and strip out all non-ascii non-alpha characters.
For example:
"This is a Test" -> "thisisatest"
"A235th@#$&( er Ra{}|?>ndom" -> "atherrandom"
I have a simple function to do this:
import string
import sys
def strip_string_to_lowercase(s):
tmpStr = s.lower().strip()
retStrList = []
for x in tmpStr:
if x in string.ascii_lowercase:
retStrList.append(x)
return ''.join(retStrList)
But I cannot help thinking there is a more efficient, or more elegant, way.
Thanks!
Edit:
Thanks to all those that answered. I learned, and in some cases re-learned, a good deal of python.
I would:
- lowercase the string
- replace all
[^a-z]
with""
Like that:
def strip_string_to_lowercase():
nonascii = re.compile('[^a-z]')
return lambda s: nonascii.sub('', s.lower().strip())
EDIT: It turns out that the original version (below) is really slow, though some performance can be gained by converting it into a closure (above).
def strip_string_to_lowercase(s):
return re.sub('[^a-z]', '', s.lower().strip())
My performance measurements with 100,000 iterations against the string
"A235th@#$&( er Ra{}|?>ndom"
revealed that:
f_re_0 took 2672.000 ms
(this is the original version of this answer)f_re_1 took 2109.000 ms
(this is the closure version shown above)f_re_2 took 2031.000 ms
(the closure version, without the redundantstrip()
)f_fl_1 took 1953.000 ms
(unwind’sfilter
/lambda
version)f_fl_2 took 1485.000 ms
(Coady’sfilter
version)f_jn_1 took 1860.000 ms
(Dana’sjoin
version)
For the sake of the test, I did not print
the results.
Not especially runtime efficient, but certainly nicer on poor, tired coder eyes:
def strip_string_and_lowercase(s):
return ''.join(c for c in s.lower() if c in string.ascii_lowercase)
>>> import string
>>> a = "O235th@#$&( er Ra{}|?<ndom"
>>> ''.join(i for i in a.lower() if i in string.ascii_lowercase)
'otheraltndom'
doing essentially the same as you.
This is a typical application of list compehension:
import string
s = "O235th@#$&( er Ra{}|?<ndom"
print ''.join(c for c in s.lower() if c in string.ascii_lowercase)
It won’t filter out “<” (html entity), as in your example, but I assume that was accidental cut and past problem.
Personally I would use a regular expression and then convert the final string to lower case. I have no idea how to write it in Python, but the basic idea is to:
- Remove characters in string that don’t match case-insensitive regex “
w
“ - Convert string to lower-case
or vice versa.
Similar to @Dana’s, but I think this sounds like a filtering job, and that should be visible in the code. Also without the need to explicitly call join()
:
def strip_string_to_lowercase(s):
return filter(lambda x: x in string.ascii_lowercase, s.lower())
>>> filter(str.isalpha, "This is a Test").lower()
'thisisatest'
>>> filter(str.isalpha, "A235th@#$&( er Ra{}|?>ndom").lower()
'atherrandom'
Another solution (not that pythonic, but very fast) is to use string.translate – though note that this will not work for unicode. It’s also worth noting that you can speed up Dana’s code by moving the characters into a set (which looks up by hash, rather than performing a linear search each time). Here are the timings I get for various of the solutions given:
import string, re, timeit
# Precomputed values (for str_join_set and translate)
letter_set = frozenset(string.ascii_lowercase + string.ascii_uppercase)
tab = string.maketrans(string.ascii_lowercase + string.ascii_uppercase,
string.ascii_lowercase * 2)
deletions = ''.join(ch for ch in map(chr,range(256)) if ch not in letter_set)
s="A235th@#$&( er Ra{}|?>ndom"
# From unwind's filter approach
def test_filter(s):
return filter(lambda x: x in string.ascii_lowercase, s.lower())
# using set instead (and contains)
def test_filter_set(s):
return filter(letter_set.__contains__, s).lower()
# Tomalak's solution
def test_regex(s):
return re.sub('[^a-z]', '', s.lower())
# Dana's
def test_str_join(s):
return ''.join(c for c in s.lower() if c in string.ascii_lowercase)
# Modified to use a set.
def test_str_join_set(s):
return ''.join(c for c in s.lower() if c in letter_set)
# Translate approach.
def test_translate(s):
return string.translate(s, tab, deletions)
for test in sorted(globals()):
if test.startswith("test_"):
assert globals()[test](s)=='atherrandom'
print "%30s : %s" % (test, timeit.Timer("f(s)",
"from __main__ import %s as f, s" % test).timeit(200000))
This gives me:
test_filter : 2.57138351271
test_filter_set : 0.981806765698
test_regex : 3.10069885233
test_str_join : 2.87172979743
test_str_join_set : 2.43197956381
test_translate : 0.335367566218
[Edit] Updated with filter solutions as well. (Note that using set.__contains__
makes a big difference here, as it avoids making an extra function call for the lambda.
I added the filter solutions to Brian’s code:
import string, re, timeit
# Precomputed values (for str_join_set and translate)
letter_set = frozenset(string.ascii_lowercase + string.ascii_uppercase)
tab = string.maketrans(string.ascii_lowercase + string.ascii_uppercase,
string.ascii_lowercase * 2)
deletions = ''.join(ch for ch in map(chr,range(256)) if ch not in letter_set)
s="A235th@#$&( er Ra{}|?>ndom"
def test_original(s):
tmpStr = s.lower().strip()
retStrList = []
for x in tmpStr:
if x in string.ascii_lowercase:
retStrList.append(x)
return ''.join(retStrList)
def test_regex(s):
return re.sub('[^a-z]', '', s.lower())
def test_regex_closure(s):
nonascii = re.compile('[^a-z]')
def replacer(s):
return nonascii.sub('', s.lower().strip())
return replacer(s)
def test_str_join(s):
return ''.join(c for c in s.lower() if c in string.ascii_lowercase)
def test_str_join_set(s):
return ''.join(c for c in s.lower() if c in letter_set)
def test_filter_set(s):
return filter(letter_set.__contains__, s.lower())
def test_filter_isalpha(s):
return filter(str.isalpha, s).lower()
def test_filter_lambda(s):
return filter(lambda x: x in string.ascii_lowercase, s.lower())
def test_translate(s):
return string.translate(s, tab, deletions)
for test in sorted(globals()):
if test.startswith("test_"):
print "%30s : %s" % (test, timeit.Timer("f(s)",
"from __main__ import %s as f, s" % test).timeit(200000))
This gives me:
test_filter_isalpha : 1.31981746283
test_filter_lambda : 2.23935583992
test_filter_set : 0.76511679557
test_original : 2.13079176264
test_regex : 2.44295629752
test_regex_closure : 2.65205913042
test_str_join : 2.25571266739
test_str_join_set : 1.75565888961
test_translate : 0.269259640541
It appears that isalpha is using a similar algorithm, at least in terms of O(), to the set algorithm.
Edit:
Added the filter set, and renamed the filter functions to be a little more clear.
Python 2.x translate
method
Convert to lowercase and filter non-ascii non-alpha characters:
from string import ascii_letters, ascii_lowercase, maketrans
table = maketrans(ascii_letters, ascii_lowercase*2)
deletechars = ''.join(set(maketrans('','')) - set(ascii_letters))
print "A235th@#$&( er Ra{}|?>ndom".translate(table, deletechars)
# -> 'atherrandom'
Python 3 translate
method
Filter non-ascii:
ascii_bytes = "A235th@#$&(٠٫٢٥ er Ra{}|?>ndom".encode('ascii', 'ignore')
Use bytes.translate()
to convert to lowercase and delete non-alpha bytes:
from string import ascii_letters, ascii_lowercase
alpha, lower = [s.encode('ascii') for s in [ascii_letters, ascii_lowercase]]
table = bytes.maketrans(alpha, lower*2) # convert to lowercase
deletebytes = bytes(set(range(256)) - set(alpha)) # delete nonalpha
print(ascii_bytes.translate(table, deletebytes))
# -> b'atherrandom'
Python 2.x:
import string
valid_chars= string.ascii_lowercase + string.ascii_uppercase
def only_lower_ascii_alpha(text):
return filter(valid_chars.__contains__, text).lower()
Works with either str
or unicode
arguments.
>>> only_lower_ascii_alpha("Hello there 123456!")
'hellothere'
>>> only_lower_ascii_alpha(u"435 café")
u'caf'