Python split by regular expression
Question:
In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, [email protected]
would return:
but, take the following string:
I know my best friend mailto:[email protected]!
This would return null
. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
Answers:
I’d say you’re looking for re.findall
:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:[email protected]!')
['[email protected]']
Notice that findall
can handle more than one email address:
>>> email_reg.findall('Text text [email protected], text text, [email protected]!')
['[email protected]', '[email protected]']
Use re.search
or re.findall
.
You also need to escape your expression properly (.
needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^
and $
(for example with b
), eg:
r"b[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+b"
The problem I see in your regex is your use of ^
which matches the start of a string and $
which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9._-]+@[A-Za-z0-9-]+.[A-Za-z0-9-.]+","I know my best friend mailto:[email protected]!")
['[email protected]']
>>> re.findall("[A-Za-z0-9._-]+@[A-Za-z0-9-]+.[A-Za-z0-9-.]+","Hello, [email protected]")
['[email protected]']
>>>
In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, [email protected]
would return:
but, take the following string:
I know my best friend mailto:[email protected]!
This would return null
. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
I’d say you’re looking for re.findall
:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:[email protected]!')
['[email protected]']
Notice that findall
can handle more than one email address:
>>> email_reg.findall('Text text [email protected], text text, [email protected]!')
['[email protected]', '[email protected]']
Use re.search
or re.findall
.
You also need to escape your expression properly (.
needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^
and $
(for example with b
), eg:
r"b[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+b"
The problem I see in your regex is your use of ^
which matches the start of a string and $
which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9._-]+@[A-Za-z0-9-]+.[A-Za-z0-9-.]+","I know my best friend mailto:[email protected]!")
['[email protected]']
>>> re.findall("[A-Za-z0-9._-]+@[A-Za-z0-9-]+.[A-Za-z0-9-.]+","Hello, [email protected]")
['[email protected]']
>>>