Stripping everything but alphanumeric chars from a string in Python
Question:
What is the best way to strip all non alphanumeric characters from a string, using Python?
The solutions presented in the PHP variant of this question will probably work with some minor adjustments, but don’t seem very ‘pythonic’ to me.
For the record, I don’t just want to strip periods and commas (and other punctuation), but also quotes, brackets, etc.
Answers:
Regular expressions to the rescue:
import re
re.sub(r'W+', '', your_string)
By Python definition 'W
== [^a-zA-Z0-9_]
, which excludes all numbers
, letters
and _
How about:
def ExtractAlphanumeric(InputString):
from string import ascii_letters, digits
return "".join([ch for ch in InputString if ch in (ascii_letters + digits)])
This works by using list comprehension to produce a list of the characters in InputString
if they are present in the combined ascii_letters
and digits
strings. It then joins the list together into a string.
>>> import re
>>> string = "[email protected]£$%[};'""
>>> pattern = re.compile('W')
>>> string = re.sub(pattern, '', string)
>>> print string
Kl13
You could try:
print ''.join(ch for ch in some_string if ch.isalnum())
I just timed some functions out of curiosity. In these tests I’m removing non-alphanumeric characters from the string string.printable
(part of the built-in string
module). The use of compiled '[W_]+'
and pattern.sub('', str)
was found to be fastest.
$ python -m timeit -s
"import string"
"''.join(ch for ch in string.printable if ch.isalnum())"
10000 loops, best of 3: 57.6 usec per loop
$ python -m timeit -s
"import string"
"filter(str.isalnum, string.printable)"
10000 loops, best of 3: 37.9 usec per loop
$ python -m timeit -s
"import re, string"
"re.sub('[W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop
$ python -m timeit -s
"import re, string"
"re.sub('[W_]+', '', string.printable)"
100000 loops, best of 3: 15 usec per loop
$ python -m timeit -s
"import re, string; pattern = re.compile('[W_]+')"
"pattern.sub('', string.printable)"
100000 loops, best of 3: 11.2 usec per loop
Use the str.translate()
method.
Presuming you will be doing this often:
-
Once, create a string containing all the characters you wish to delete:
delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())
-
Whenever you want to scrunch a string:
scrunched = s.translate(None, delchars)
The setup cost probably compares favourably with re.compile
; the marginal cost is way lower:
C:junk>python26python -mtimeit -s"import string;d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s=string.printable" "s.translate(None,d)"
100000 loops, best of 3: 2.04 usec per loop
C:junk>python26python -mtimeit -s"import re,string;s=string.printable;r=re.compile(r'[W_]+')" "r.sub('',s)"
100000 loops, best of 3: 7.34 usec per loop
Note: Using string.printable
as benchmark data gives the pattern '[W_]+'
an unfair advantage; all the non-alphanumeric characters are in one bunch … in typical data there would be more than one substitution to do:
C:junk>python26python -c "import string; s = string.printable; print len(s),repr(s)"
100 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;=>[email protected][\]^_`{|}~ tnrx0bx0c'
Here’s what happens if you give re.sub
a bit more work to do:
C:junk>python26python -mtimeit -s"d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s='foo-'*25" "s.translate(None,d)"
1000000 loops, best of 3: 1.97 usec per loop
C:junk>python26python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[W_]+')" "r.sub('',s)"
10000 loops, best of 3: 26.4 usec per loop
As a spin off from some other answers here, I offer a really simple and flexible way to define a set of characters that you want to limit a string’s content to. In this case, I’m allowing alphanumerics PLUS dash and underscore. Just add or remove characters from my PERMITTED_CHARS
as suits your use case.
PERMITTED_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-"
someString = "".join(c for c in someString if c in PERMITTED_CHARS)
for char in my_string:
if not char.isalnum():
my_string = my_string.replace(char,"")
sent = "".join(e for e in sent if e.isalpha())
Timing with random strings of ASCII printables:
from inspect import getsource
from random import sample
import re
from string import printable
from timeit import timeit
pattern_single = re.compile(r'[W]')
pattern_repeat = re.compile(r'[W]+')
translation_tb = str.maketrans('', '', ''.join(c for c in map(chr, range(256)) if not c.isalnum()))
def generate_test_string(length):
return ''.join(sample(printable, length))
def main():
for i in range(0, 60, 10):
for test in [
lambda: ''.join(c for c in generate_test_string(i) if c.isalnum()),
lambda: ''.join(filter(str.isalnum, generate_test_string(i))),
lambda: re.sub(r'[W]', '', generate_test_string(i)),
lambda: re.sub(r'[W]+', '', generate_test_string(i)),
lambda: pattern_single.sub('', generate_test_string(i)),
lambda: pattern_repeat.sub('', generate_test_string(i)),
lambda: generate_test_string(i).translate(translation_tb),
]:
print(timeit(test), i, getsource(test).lstrip(' lambda: ').rstrip(',n'), sep='t')
if __name__ == '__main__':
main()
Result (Python 3.7):
Time Length Code
6.3716264850008880 00 ''.join(c for c in generate_test_string(i) if c.isalnum())
5.7285426190064750 00 ''.join(filter(str.isalnum, generate_test_string(i)))
8.1875841680011940 00 re.sub(r'[W]', '', generate_test_string(i))
8.0002205439959650 00 re.sub(r'[W]+', '', generate_test_string(i))
5.5290945199958510 00 pattern_single.sub('', generate_test_string(i))
5.4417179649972240 00 pattern_repeat.sub('', generate_test_string(i))
4.6772285089973590 00 generate_test_string(i).translate(translation_tb)
23.574712151996210 10 ''.join(c for c in generate_test_string(i) if c.isalnum())
22.829975890002970 10 ''.join(filter(str.isalnum, generate_test_string(i)))
27.210196289997840 10 re.sub(r'[W]', '', generate_test_string(i))
27.203713296003116 10 re.sub(r'[W]+', '', generate_test_string(i))
24.008979928999906 10 pattern_single.sub('', generate_test_string(i))
23.945240008994006 10 pattern_repeat.sub('', generate_test_string(i))
21.830899796994345 10 generate_test_string(i).translate(translation_tb)
38.731336012999236 20 ''.join(c for c in generate_test_string(i) if c.isalnum())
37.942474347000825 20 ''.join(filter(str.isalnum, generate_test_string(i)))
42.169366310001350 20 re.sub(r'[W]', '', generate_test_string(i))
41.933375883003464 20 re.sub(r'[W]+', '', generate_test_string(i))
38.899814646996674 20 pattern_single.sub('', generate_test_string(i))
38.636144253003295 20 pattern_repeat.sub('', generate_test_string(i))
36.201238164998360 20 generate_test_string(i).translate(translation_tb)
49.377356811004574 30 ''.join(c for c in generate_test_string(i) if c.isalnum())
48.408927293996385 30 ''.join(filter(str.isalnum, generate_test_string(i)))
53.901889764994850 30 re.sub(r'[W]', '', generate_test_string(i))
52.130339455994545 30 re.sub(r'[W]+', '', generate_test_string(i))
50.061149017004940 30 pattern_single.sub('', generate_test_string(i))
49.366573111998150 30 pattern_repeat.sub('', generate_test_string(i))
46.649754120997386 30 generate_test_string(i).translate(translation_tb)
63.107938601999194 40 ''.join(c for c in generate_test_string(i) if c.isalnum())
65.116287978999030 40 ''.join(filter(str.isalnum, generate_test_string(i)))
71.477421126997800 40 re.sub(r'[W]', '', generate_test_string(i))
66.027950693998720 40 re.sub(r'[W]+', '', generate_test_string(i))
63.315361931003280 40 pattern_single.sub('', generate_test_string(i))
62.342320287003530 40 pattern_repeat.sub('', generate_test_string(i))
58.249303059004890 40 generate_test_string(i).translate(translation_tb)
73.810345625002810 50 ''.join(c for c in generate_test_string(i) if c.isalnum())
72.593953348005020 50 ''.join(filter(str.isalnum, generate_test_string(i)))
76.048324580995540 50 re.sub(r'[W]', '', generate_test_string(i))
75.106637657001560 50 re.sub(r'[W]+', '', generate_test_string(i))
74.681338128997600 50 pattern_single.sub('', generate_test_string(i))
72.430461594005460 50 pattern_repeat.sub('', generate_test_string(i))
69.394243567003290 50 generate_test_string(i).translate(translation_tb)
str.maketrans
& str.translate
is fastest, but includes all non-ASCII characters.
re.compile
& pattern.sub
is slower, but is somehow faster than ''.join
& filter
.
If i understood correctly the easiest way is to use regular expression as it provides you lots of flexibility but the other simple method is to use for loop following is the code with example I also counted the occurrence of word and stored in dictionary..
s = """An... essay is, generally, a piece of writing that gives the author's own
argument — but the definition is vague,
overlapping with those of a paper, an article, a pamphlet, and a short story. Essays
have traditionally been
sub-classified as formal and informal. Formal essays are characterized by "serious
purpose, dignity, logical
organization, length," whereas the informal essay is characterized by "the personal
element (self-revelation,
individual tastes and experiences, confidential manner), humor, graceful style,
rambling structure, unconventionality
or novelty of theme," etc.[1]"""
d = {} # creating empty dic
words = s.split() # spliting string and stroing in list
for word in words:
new_word = ''
for c in word:
if c.isalnum(): # checking if indiviual chr is alphanumeric or not
new_word = new_word + c
print(new_word, end=' ')
# if new_word not in d:
# d[new_word] = 1
# else:
# d[new_word] = d[new_word] +1
print(d)
please rate this if this answer is useful!
For a simple one-liner (Python 3.0):
''.join(filter( lambda x: x in '0123456789abcdefghijklmnopqrstuvwxyz', the_string_you_want_stripped ))
For Python < 3.0:
filter( lambda x: x in '0123456789abcdefghijklmnopqrstuvwxyz', the_string_you_want_stripped )
Note: you could add other characters to the allowed characters list if desired (e.g. ‘0123456789abcdefghijklmnopqrstuvwxyz.,_’).
Python 3
Uses the same method as @John Machin’s answer but updated for Python 3:
- larger character set
- slight changes to how
translate
works.
Python code is now assumed to be encoded in UTF-8
(source: PEP 3120)
This means the string containing all the characters you wish to delete gets much larger:
del_chars = ''.join(c for c in map(chr, range(1114111)) if not c.isalnum())
And the translate
method now needs to consume a translation table which we can create with maketrans()
:
del_map = str.maketrans('', '', del_chars)
Now, as before, any string s
you want to "scrunch":
scrunched = s.translate(del_map)
Using the last timing example from @Joe Machin, we can see it still beats re
by an order of magnitude:
> python -mtimeit -s"d=''.join(c for c in map(chr,range(1114111)) if not c.isalnum());m=str.maketrans('','',d);s='foo-'*25" "s.translate(m)"
1000000 loops, best of 5: 255 nsec per loop
> python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[W_]+')" "r.sub('',s)"
50000 loops, best of 5: 4.8 usec per loop
A simple solution because all answers here are complicated
filtered = ''
for c in unfiltered:
if str.isalnum(c):
filtered += c
print(filtered)
I checked the results with perfplot (a project of mine) and found that for short strings,
"".join(filter(str.isalnum, s))
is fastest. For long strings (200+ chars)
re.sub("[W_]", "", s)
is fastest.
Code to reproduce the plot:
import perfplot
import random
import re
import string
pattern = re.compile("[W_]+")
def setup(n):
return "".join(random.choices(string.ascii_letters + string.digits, k=n))
def string_alphanum(s):
return "".join(ch for ch in s if ch.isalnum())
def filter_str(s):
return "".join(filter(str.isalnum, s))
def re_sub1(s):
return re.sub("[W_]", "", s)
def re_sub2(s):
return re.sub("[W_]+", "", s)
def re_sub3(s):
return pattern.sub("", s)
b = perfplot.bench(
setup=setup,
kernels=[string_alphanum, filter_str, re_sub1, re_sub2, re_sub3],
n_range=[2**k for k in range(10)],
)
b.save("out.png")
b.show()
If you’d like to preserve characters like áéíóúãẽĩõũ for example, use this:
import re
re.sub('[Wd_]+', '', your_string)
What is the best way to strip all non alphanumeric characters from a string, using Python?
The solutions presented in the PHP variant of this question will probably work with some minor adjustments, but don’t seem very ‘pythonic’ to me.
For the record, I don’t just want to strip periods and commas (and other punctuation), but also quotes, brackets, etc.
Regular expressions to the rescue:
import re
re.sub(r'W+', '', your_string)
By Python definition
'W
==[^a-zA-Z0-9_]
, which excludes allnumbers
,letters
and_
How about:
def ExtractAlphanumeric(InputString):
from string import ascii_letters, digits
return "".join([ch for ch in InputString if ch in (ascii_letters + digits)])
This works by using list comprehension to produce a list of the characters in InputString
if they are present in the combined ascii_letters
and digits
strings. It then joins the list together into a string.
>>> import re
>>> string = "[email protected]£$%[};'""
>>> pattern = re.compile('W')
>>> string = re.sub(pattern, '', string)
>>> print string
Kl13
You could try:
print ''.join(ch for ch in some_string if ch.isalnum())
I just timed some functions out of curiosity. In these tests I’m removing non-alphanumeric characters from the string string.printable
(part of the built-in string
module). The use of compiled '[W_]+'
and pattern.sub('', str)
was found to be fastest.
$ python -m timeit -s
"import string"
"''.join(ch for ch in string.printable if ch.isalnum())"
10000 loops, best of 3: 57.6 usec per loop
$ python -m timeit -s
"import string"
"filter(str.isalnum, string.printable)"
10000 loops, best of 3: 37.9 usec per loop
$ python -m timeit -s
"import re, string"
"re.sub('[W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop
$ python -m timeit -s
"import re, string"
"re.sub('[W_]+', '', string.printable)"
100000 loops, best of 3: 15 usec per loop
$ python -m timeit -s
"import re, string; pattern = re.compile('[W_]+')"
"pattern.sub('', string.printable)"
100000 loops, best of 3: 11.2 usec per loop
Use the str.translate()
method.
Presuming you will be doing this often:
-
Once, create a string containing all the characters you wish to delete:
delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())
-
Whenever you want to scrunch a string:
scrunched = s.translate(None, delchars)
The setup cost probably compares favourably with re.compile
; the marginal cost is way lower:
C:junk>python26python -mtimeit -s"import string;d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s=string.printable" "s.translate(None,d)"
100000 loops, best of 3: 2.04 usec per loop
C:junk>python26python -mtimeit -s"import re,string;s=string.printable;r=re.compile(r'[W_]+')" "r.sub('',s)"
100000 loops, best of 3: 7.34 usec per loop
Note: Using string.printable
as benchmark data gives the pattern '[W_]+'
an unfair advantage; all the non-alphanumeric characters are in one bunch … in typical data there would be more than one substitution to do:
C:junk>python26python -c "import string; s = string.printable; print len(s),repr(s)"
100 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;=>[email protected][\]^_`{|}~ tnrx0bx0c'
Here’s what happens if you give re.sub
a bit more work to do:
C:junk>python26python -mtimeit -s"d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s='foo-'*25" "s.translate(None,d)"
1000000 loops, best of 3: 1.97 usec per loop
C:junk>python26python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[W_]+')" "r.sub('',s)"
10000 loops, best of 3: 26.4 usec per loop
As a spin off from some other answers here, I offer a really simple and flexible way to define a set of characters that you want to limit a string’s content to. In this case, I’m allowing alphanumerics PLUS dash and underscore. Just add or remove characters from my PERMITTED_CHARS
as suits your use case.
PERMITTED_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-"
someString = "".join(c for c in someString if c in PERMITTED_CHARS)
for char in my_string:
if not char.isalnum():
my_string = my_string.replace(char,"")
sent = "".join(e for e in sent if e.isalpha())
Timing with random strings of ASCII printables:
from inspect import getsource
from random import sample
import re
from string import printable
from timeit import timeit
pattern_single = re.compile(r'[W]')
pattern_repeat = re.compile(r'[W]+')
translation_tb = str.maketrans('', '', ''.join(c for c in map(chr, range(256)) if not c.isalnum()))
def generate_test_string(length):
return ''.join(sample(printable, length))
def main():
for i in range(0, 60, 10):
for test in [
lambda: ''.join(c for c in generate_test_string(i) if c.isalnum()),
lambda: ''.join(filter(str.isalnum, generate_test_string(i))),
lambda: re.sub(r'[W]', '', generate_test_string(i)),
lambda: re.sub(r'[W]+', '', generate_test_string(i)),
lambda: pattern_single.sub('', generate_test_string(i)),
lambda: pattern_repeat.sub('', generate_test_string(i)),
lambda: generate_test_string(i).translate(translation_tb),
]:
print(timeit(test), i, getsource(test).lstrip(' lambda: ').rstrip(',n'), sep='t')
if __name__ == '__main__':
main()
Result (Python 3.7):
Time Length Code
6.3716264850008880 00 ''.join(c for c in generate_test_string(i) if c.isalnum())
5.7285426190064750 00 ''.join(filter(str.isalnum, generate_test_string(i)))
8.1875841680011940 00 re.sub(r'[W]', '', generate_test_string(i))
8.0002205439959650 00 re.sub(r'[W]+', '', generate_test_string(i))
5.5290945199958510 00 pattern_single.sub('', generate_test_string(i))
5.4417179649972240 00 pattern_repeat.sub('', generate_test_string(i))
4.6772285089973590 00 generate_test_string(i).translate(translation_tb)
23.574712151996210 10 ''.join(c for c in generate_test_string(i) if c.isalnum())
22.829975890002970 10 ''.join(filter(str.isalnum, generate_test_string(i)))
27.210196289997840 10 re.sub(r'[W]', '', generate_test_string(i))
27.203713296003116 10 re.sub(r'[W]+', '', generate_test_string(i))
24.008979928999906 10 pattern_single.sub('', generate_test_string(i))
23.945240008994006 10 pattern_repeat.sub('', generate_test_string(i))
21.830899796994345 10 generate_test_string(i).translate(translation_tb)
38.731336012999236 20 ''.join(c for c in generate_test_string(i) if c.isalnum())
37.942474347000825 20 ''.join(filter(str.isalnum, generate_test_string(i)))
42.169366310001350 20 re.sub(r'[W]', '', generate_test_string(i))
41.933375883003464 20 re.sub(r'[W]+', '', generate_test_string(i))
38.899814646996674 20 pattern_single.sub('', generate_test_string(i))
38.636144253003295 20 pattern_repeat.sub('', generate_test_string(i))
36.201238164998360 20 generate_test_string(i).translate(translation_tb)
49.377356811004574 30 ''.join(c for c in generate_test_string(i) if c.isalnum())
48.408927293996385 30 ''.join(filter(str.isalnum, generate_test_string(i)))
53.901889764994850 30 re.sub(r'[W]', '', generate_test_string(i))
52.130339455994545 30 re.sub(r'[W]+', '', generate_test_string(i))
50.061149017004940 30 pattern_single.sub('', generate_test_string(i))
49.366573111998150 30 pattern_repeat.sub('', generate_test_string(i))
46.649754120997386 30 generate_test_string(i).translate(translation_tb)
63.107938601999194 40 ''.join(c for c in generate_test_string(i) if c.isalnum())
65.116287978999030 40 ''.join(filter(str.isalnum, generate_test_string(i)))
71.477421126997800 40 re.sub(r'[W]', '', generate_test_string(i))
66.027950693998720 40 re.sub(r'[W]+', '', generate_test_string(i))
63.315361931003280 40 pattern_single.sub('', generate_test_string(i))
62.342320287003530 40 pattern_repeat.sub('', generate_test_string(i))
58.249303059004890 40 generate_test_string(i).translate(translation_tb)
73.810345625002810 50 ''.join(c for c in generate_test_string(i) if c.isalnum())
72.593953348005020 50 ''.join(filter(str.isalnum, generate_test_string(i)))
76.048324580995540 50 re.sub(r'[W]', '', generate_test_string(i))
75.106637657001560 50 re.sub(r'[W]+', '', generate_test_string(i))
74.681338128997600 50 pattern_single.sub('', generate_test_string(i))
72.430461594005460 50 pattern_repeat.sub('', generate_test_string(i))
69.394243567003290 50 generate_test_string(i).translate(translation_tb)
str.maketrans
& str.translate
is fastest, but includes all non-ASCII characters.
re.compile
& pattern.sub
is slower, but is somehow faster than ''.join
& filter
.
If i understood correctly the easiest way is to use regular expression as it provides you lots of flexibility but the other simple method is to use for loop following is the code with example I also counted the occurrence of word and stored in dictionary..
s = """An... essay is, generally, a piece of writing that gives the author's own
argument — but the definition is vague,
overlapping with those of a paper, an article, a pamphlet, and a short story. Essays
have traditionally been
sub-classified as formal and informal. Formal essays are characterized by "serious
purpose, dignity, logical
organization, length," whereas the informal essay is characterized by "the personal
element (self-revelation,
individual tastes and experiences, confidential manner), humor, graceful style,
rambling structure, unconventionality
or novelty of theme," etc.[1]"""
d = {} # creating empty dic
words = s.split() # spliting string and stroing in list
for word in words:
new_word = ''
for c in word:
if c.isalnum(): # checking if indiviual chr is alphanumeric or not
new_word = new_word + c
print(new_word, end=' ')
# if new_word not in d:
# d[new_word] = 1
# else:
# d[new_word] = d[new_word] +1
print(d)
please rate this if this answer is useful!
For a simple one-liner (Python 3.0):
''.join(filter( lambda x: x in '0123456789abcdefghijklmnopqrstuvwxyz', the_string_you_want_stripped ))
For Python < 3.0:
filter( lambda x: x in '0123456789abcdefghijklmnopqrstuvwxyz', the_string_you_want_stripped )
Note: you could add other characters to the allowed characters list if desired (e.g. ‘0123456789abcdefghijklmnopqrstuvwxyz.,_’).
Python 3
Uses the same method as @John Machin’s answer but updated for Python 3:
- larger character set
- slight changes to how
translate
works.
Python code is now assumed to be encoded in UTF-8
(source: PEP 3120)
This means the string containing all the characters you wish to delete gets much larger:
del_chars = ''.join(c for c in map(chr, range(1114111)) if not c.isalnum())
And the translate
method now needs to consume a translation table which we can create with maketrans()
:
del_map = str.maketrans('', '', del_chars)
Now, as before, any string s
you want to "scrunch":
scrunched = s.translate(del_map)
Using the last timing example from @Joe Machin, we can see it still beats re
by an order of magnitude:
> python -mtimeit -s"d=''.join(c for c in map(chr,range(1114111)) if not c.isalnum());m=str.maketrans('','',d);s='foo-'*25" "s.translate(m)"
1000000 loops, best of 5: 255 nsec per loop
> python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[W_]+')" "r.sub('',s)"
50000 loops, best of 5: 4.8 usec per loop
A simple solution because all answers here are complicated
filtered = ''
for c in unfiltered:
if str.isalnum(c):
filtered += c
print(filtered)
I checked the results with perfplot (a project of mine) and found that for short strings,
"".join(filter(str.isalnum, s))
is fastest. For long strings (200+ chars)
re.sub("[W_]", "", s)
is fastest.
Code to reproduce the plot:
import perfplot
import random
import re
import string
pattern = re.compile("[W_]+")
def setup(n):
return "".join(random.choices(string.ascii_letters + string.digits, k=n))
def string_alphanum(s):
return "".join(ch for ch in s if ch.isalnum())
def filter_str(s):
return "".join(filter(str.isalnum, s))
def re_sub1(s):
return re.sub("[W_]", "", s)
def re_sub2(s):
return re.sub("[W_]+", "", s)
def re_sub3(s):
return pattern.sub("", s)
b = perfplot.bench(
setup=setup,
kernels=[string_alphanum, filter_str, re_sub1, re_sub2, re_sub3],
n_range=[2**k for k in range(10)],
)
b.save("out.png")
b.show()
If you’d like to preserve characters like áéíóúãẽĩõũ for example, use this:
import re
re.sub('[Wd_]+', '', your_string)