Python regex to get everything until the first dot in a string
Question:
find = re.compile("^(.*)..*")
for l in lines:
m = re.match(find, l)
print m.group(1)
I want to regex whatever in a string until the first dot.
in [email protected]
, I want a@b
in [email protected]
, I want a@b
in [email protected]
, I want a@b
What my code is giving me…
[email protected]
prints a@b
[email protected]
prints [email protected]
[email protected]
prints [email protected]
what should find be so that it only gets a@b?
Answers:
By default all the quantifiers are greedy in nature. In the sense, they will try to consume as much string as they can. You can make them reluctant by appending a ?
after them:
find = re.compile(r"^(.*?)..*")
As noted in comment, this approach would fail if there is no period in your string. So, it depends upon how you want it to behave. But if you want to get the complete string in that case, then you can use a negated character class:
find = re.compile(r"^([^.]*).*")
it will automatically stop after encountering the first period, or at the end of the string.
Also you don’t want to use re.match()
there. re.search()
should be just fine. You can modify your code to:
find = re.compile(r"^[^.]*")
for l in lines:
print re.search(find, l).group(0)
You can use .find()
instead of regex in this situation:
>>> s = "[email protected]"
>>> print(s[0:s.find('.')])
a@b
Considering the comments, here’s some modification using .index()
(it’s similar to .find()
except that it returns an error when there’s no matched string instead of -1):
>>> s = "[email protected]"
>>> try:
... index = s.index('.')
... except ValueError:
... index = len(s)
...
>>> print(s[:index])
a@b
I recommend partition
or split
in this case; they work well when there is no dot.
text = "[email protected]"
print text.partition(".")[0]
print text.split(".", 1)[0]
You can use the split
method: split the string at the .
character one time, and you will get a tuple of (before the first period, after the first period). The notation would be:
mystring.split(".", 1)
Then you can simply create a generator that “yields” the part you are interested, and ignores the one you are not (the _
notation). It works as follows:
entries = [
"[email protected]",
"[email protected]",
"[email protected]",
]
for token, _ in (entry.split(".", 1) for entry in entries):
print token
Output:
a@b
a@b
a@b
The documentation for the split
method can be found online:
str.split([sep[, maxsplit]])
Return a list of the words in the string,
using sep
as the delimiter string. If maxsplit
is given, at most
maxsplit
splits are done (thus, the list will have at most maxsplit+1
elements). If maxsplit
is not specified or -1, then there is no limit
on the number of splits (all possible splits are made).
import re
data='[email protected]'
re.sub('..*','',data)
find = re.compile("^(.*)..*")
for l in lines:
m = re.match(find, l)
print m.group(1)
I want to regex whatever in a string until the first dot.
in [email protected]
, I want a@b
in [email protected]
, I want a@b
in [email protected]
, I want a@b
What my code is giving me…
[email protected]
printsa@b
[email protected]
prints[email protected]
[email protected]
prints[email protected]
what should find be so that it only gets a@b?
By default all the quantifiers are greedy in nature. In the sense, they will try to consume as much string as they can. You can make them reluctant by appending a ?
after them:
find = re.compile(r"^(.*?)..*")
As noted in comment, this approach would fail if there is no period in your string. So, it depends upon how you want it to behave. But if you want to get the complete string in that case, then you can use a negated character class:
find = re.compile(r"^([^.]*).*")
it will automatically stop after encountering the first period, or at the end of the string.
Also you don’t want to use re.match()
there. re.search()
should be just fine. You can modify your code to:
find = re.compile(r"^[^.]*")
for l in lines:
print re.search(find, l).group(0)
You can use .find()
instead of regex in this situation:
>>> s = "[email protected]"
>>> print(s[0:s.find('.')])
a@b
Considering the comments, here’s some modification using .index()
(it’s similar to .find()
except that it returns an error when there’s no matched string instead of -1):
>>> s = "[email protected]"
>>> try:
... index = s.index('.')
... except ValueError:
... index = len(s)
...
>>> print(s[:index])
a@b
I recommend partition
or split
in this case; they work well when there is no dot.
text = "[email protected]"
print text.partition(".")[0]
print text.split(".", 1)[0]
You can use the split
method: split the string at the .
character one time, and you will get a tuple of (before the first period, after the first period). The notation would be:
mystring.split(".", 1)
Then you can simply create a generator that “yields” the part you are interested, and ignores the one you are not (the _
notation). It works as follows:
entries = [
"[email protected]",
"[email protected]",
"[email protected]",
]
for token, _ in (entry.split(".", 1) for entry in entries):
print token
Output:
a@b
a@b
a@b
The documentation for the split
method can be found online:
str.split([sep[, maxsplit]])
Return a list of the words in the string,
usingsep
as the delimiter string. Ifmaxsplit
is given, at most
maxsplit
splits are done (thus, the list will have at mostmaxsplit+1
elements). Ifmaxsplit
is not specified or -1, then there is no limit
on the number of splits (all possible splits are made).
import re
data='[email protected]'
re.sub('..*','',data)