Return string with first match for a regex, handling case where there is no match
Question:
I want to get the first match of a regex.
In the following case, I have a list:
text = 'aa33bbb44'
re.findall('d+',text)
# ['33', '44']
I could extract the first element of the list:
text = 'aa33bbb44'
re.findall('d+',text)[0]
# '33'
But that only works if there is at least one match, otherwise I’ll get an IndexError
:
text = 'aazzzbbb'
re.findall('d+',text)[0]
# IndexError: list index out of range
In which case I could define a function:
def return_first_match(text):
try:
result = re.findall('d+',text)[0]
except Exception, IndexError:
result = ''
return result
Is there a way of obtaining that result without defining a new function?
Answers:
You can do:
x = re.findall('d+', text)
result = x[0] if len(x) > 0 else ''
Note that your question isn’t exactly related to regex. Rather, how do you safely find an element from an array, if it has none.
Maybe this would perform a bit better in case greater amount of input data does not contain your wanted piece because except has greater cost.
def return_first_match(text):
result = re.findall('d+',text)
result = result[0] if result else ""
return result
You could embed the ''
default in your regex by adding |$
:
>>> re.findall('d+|$', 'aa33bbb44')[0]
'33'
>>> re.findall('d+|$', 'aazzzbbb')[0]
''
>>> re.findall('d+|$', '')[0]
''
Also works with re.search
pointed out by others:
>>> re.search('d+|$', 'aa33bbb44').group()
'33'
>>> re.search('d+|$', 'aazzzbbb').group()
''
>>> re.search('d+|$', '').group()
''
If you only need the first match, then use re.search
instead of re.findall
:
>>> m = re.search('d+', 'aa33bbb44')
>>> m.group()
'33'
>>> m = re.search('d+', 'aazzzbbb')
>>> m.group()
Traceback (most recent call last):
File "<pyshell#281>", line 1, in <module>
m.group()
AttributeError: 'NoneType' object has no attribute 'group'
Then you can use m
as a checking condition as:
>>> m = re.search('d+', 'aa33bbb44')
>>> if m:
print('First number found = {}'.format(m.group()))
else:
print('Not Found')
First number found = 33
You shouldn’t be using .findall()
at all – .search()
is what you want. It finds the leftmost match, which is what you want (or returns None
if no match exists).
m = re.search(pattern, text)
result = m.group(0) if m else ""
Whether you want to put that in a function is up to you. It’s unusual to want to return an empty string if no match is found, which is why nothing like that is built in. It’s impossible to get confused about whether .search()
on its own finds a match (it returns None
if it didn’t, or an SRE_Match
object if it did).
I’d go with:
r = re.search("d+", ch)
result = r.group(0) if r else ""
re.search
only looks for the first match in the string anyway, so I think it makes your intent slightly more clear than using findall
.
just assign the results to a variable then iterate the variable
text = 'aa33bbb44'
result=re.findall('d+',text)
for item in result:
print(item)
With Assignment expressions (PEP572):
text = 'aa33bbb44'
r = m.group() if (m:=re.search(r'd+',text)) is not None else ''
With re.findall
, you can convert the output into an iterator with iter()
and call next()
on it to get the first result. next()
is particularly useful for this task because a default value (e.g. ''
) can be passed to it; the default is returned if the iterator is empty, i.e. if there are no matches.
next(iter(re.findall('d+', 'aa33bbb44')), '') # '33'
next(iter(re.findall('d+', 'aazzzbbb')), '') # ''
At this point, next()
can used with re.finditer
for the job as well.
next(re.finditer('d+', 'aa33bbb44'), [''])[0] # '33'
next(re.finditer('d+', 'aazzzbbb'), [''])[0] # ''
You can also use the walrus operator with re.search
for a one-liner.
m[0] if (m:=re.search('d+', 'aa33bbb44')) else '' # '33'
m[0] if (m:=re.search('d+', 'aazzzbbb')) else '' # ''
For this specific task, the argument against re.findall
is performance and, indeed for large strings, the gap is huge. If there are multiple matches, re.findall
is much, much slower than re.search
or re.finditer
1. However, if there are no matches, re.search
with the walrus and re.finditer
are the fastest.2.
1 Timings for strings with 1mil characters and 100k matches.
text = 'aabbbccc11'*100_000
%timeit m[0] if (m:=re.search('d+', text)) else ''
# 1.94 µs ± 192 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
%timeit next(re.finditer('d+', text), [''])[0]
# 2.38 µs ± 122 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
%timeit next(iter(re.findall('d+', text)), '')
# 59 ms ± 8.65 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit re.search('d+|$', text)[0]
# 2.32 µs ± 300 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
%timeit re.findall('d+|$', text)[0]
# 82.7 ms ± 1.64 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
2 Timings for strings with 1mil characters and no matches.
text = 'aabbbcccdd'*100000
%timeit m[0] if (m:=re.search('d+', text)) else ''
# 26.3 ms ± 662 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit next(re.finditer('d+', text), [''])[0]
# 26 ms ± 195 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit next(iter(re.findall('d+', text)), '')
# 26.2 ms ± 615 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit re.search('d+|$', text)[0]
# 72.9 ms ± 14.1 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit re.findall('d+|$', text)[0]
# 67.8 ms ± 2.38 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
I want to get the first match of a regex.
In the following case, I have a list:
text = 'aa33bbb44'
re.findall('d+',text)
# ['33', '44']
I could extract the first element of the list:
text = 'aa33bbb44'
re.findall('d+',text)[0]
# '33'
But that only works if there is at least one match, otherwise I’ll get an IndexError
:
text = 'aazzzbbb'
re.findall('d+',text)[0]
# IndexError: list index out of range
In which case I could define a function:
def return_first_match(text):
try:
result = re.findall('d+',text)[0]
except Exception, IndexError:
result = ''
return result
Is there a way of obtaining that result without defining a new function?
You can do:
x = re.findall('d+', text)
result = x[0] if len(x) > 0 else ''
Note that your question isn’t exactly related to regex. Rather, how do you safely find an element from an array, if it has none.
Maybe this would perform a bit better in case greater amount of input data does not contain your wanted piece because except has greater cost.
def return_first_match(text):
result = re.findall('d+',text)
result = result[0] if result else ""
return result
You could embed the ''
default in your regex by adding |$
:
>>> re.findall('d+|$', 'aa33bbb44')[0]
'33'
>>> re.findall('d+|$', 'aazzzbbb')[0]
''
>>> re.findall('d+|$', '')[0]
''
Also works with re.search
pointed out by others:
>>> re.search('d+|$', 'aa33bbb44').group()
'33'
>>> re.search('d+|$', 'aazzzbbb').group()
''
>>> re.search('d+|$', '').group()
''
If you only need the first match, then use re.search
instead of re.findall
:
>>> m = re.search('d+', 'aa33bbb44')
>>> m.group()
'33'
>>> m = re.search('d+', 'aazzzbbb')
>>> m.group()
Traceback (most recent call last):
File "<pyshell#281>", line 1, in <module>
m.group()
AttributeError: 'NoneType' object has no attribute 'group'
Then you can use m
as a checking condition as:
>>> m = re.search('d+', 'aa33bbb44')
>>> if m:
print('First number found = {}'.format(m.group()))
else:
print('Not Found')
First number found = 33
You shouldn’t be using .findall()
at all – .search()
is what you want. It finds the leftmost match, which is what you want (or returns None
if no match exists).
m = re.search(pattern, text)
result = m.group(0) if m else ""
Whether you want to put that in a function is up to you. It’s unusual to want to return an empty string if no match is found, which is why nothing like that is built in. It’s impossible to get confused about whether .search()
on its own finds a match (it returns None
if it didn’t, or an SRE_Match
object if it did).
I’d go with:
r = re.search("d+", ch)
result = r.group(0) if r else ""
re.search
only looks for the first match in the string anyway, so I think it makes your intent slightly more clear than using findall
.
just assign the results to a variable then iterate the variable
text = 'aa33bbb44'
result=re.findall('d+',text)
for item in result:
print(item)
With Assignment expressions (PEP572):
text = 'aa33bbb44'
r = m.group() if (m:=re.search(r'd+',text)) is not None else ''
With re.findall
, you can convert the output into an iterator with iter()
and call next()
on it to get the first result. next()
is particularly useful for this task because a default value (e.g. ''
) can be passed to it; the default is returned if the iterator is empty, i.e. if there are no matches.
next(iter(re.findall('d+', 'aa33bbb44')), '') # '33'
next(iter(re.findall('d+', 'aazzzbbb')), '') # ''
At this point, next()
can used with re.finditer
for the job as well.
next(re.finditer('d+', 'aa33bbb44'), [''])[0] # '33'
next(re.finditer('d+', 'aazzzbbb'), [''])[0] # ''
You can also use the walrus operator with re.search
for a one-liner.
m[0] if (m:=re.search('d+', 'aa33bbb44')) else '' # '33'
m[0] if (m:=re.search('d+', 'aazzzbbb')) else '' # ''
For this specific task, the argument against re.findall
is performance and, indeed for large strings, the gap is huge. If there are multiple matches, re.findall
is much, much slower than re.search
or re.finditer
1. However, if there are no matches, re.search
with the walrus and re.finditer
are the fastest.2.
1 Timings for strings with 1mil characters and 100k matches.
text = 'aabbbccc11'*100_000
%timeit m[0] if (m:=re.search('d+', text)) else ''
# 1.94 µs ± 192 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
%timeit next(re.finditer('d+', text), [''])[0]
# 2.38 µs ± 122 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
%timeit next(iter(re.findall('d+', text)), '')
# 59 ms ± 8.65 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit re.search('d+|$', text)[0]
# 2.32 µs ± 300 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
%timeit re.findall('d+|$', text)[0]
# 82.7 ms ± 1.64 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
2 Timings for strings with 1mil characters and no matches.
text = 'aabbbcccdd'*100000
%timeit m[0] if (m:=re.search('d+', text)) else ''
# 26.3 ms ± 662 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit next(re.finditer('d+', text), [''])[0]
# 26 ms ± 195 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit next(iter(re.findall('d+', text)), '')
# 26.2 ms ± 615 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit re.search('d+|$', text)[0]
# 72.9 ms ± 14.1 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit re.findall('d+|$', text)[0]
# 67.8 ms ± 2.38 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)