How do I remove a substring from the end of a string?
Question:
I have the following code:
url = 'abcdc.com'
print(url.strip('.com'))
I expected: abcdc
I got: abcd
Now I do
url.rsplit('.com', 1)
Is there a better way?
See How do the .strip/.rstrip/.lstrip string methods work in Python? for a specific explanation of what the first attempt is doing.
Answers:
strip
doesn’t mean "remove this substring". x.strip(y)
treats y
as a set of characters and strips any characters in that set from both ends of x
.
On Python 3.9 and newer you can use the removeprefix
and removesuffix
methods to remove an entire substring from either side of the string:
url = 'abcdc.com'
url.removesuffix('.com') # Returns 'abcdc'
url.removeprefix('abcdc.') # Returns 'com'
The relevant Python Enhancement Proposal is PEP-616.
On Python 3.8 and older you can use endswith
and slicing:
url = 'abcdc.com'
if url.endswith('.com'):
url = url[:-4]
Or a regular expression:
import re
url = 'abcdc.com'
url = re.sub('.com$', '', url)
How about url[:-4]
?
This is a perfect use for regular expressions:
>>> import re
>>> re.match(r"(.*).com", "hello.com").group(1)
'hello'
If you know it’s an extension, then
url = 'abcdc.com'
...
url.rsplit('.', 1)[0] # split at '.', starting from the right, maximum 1 split
This works equally well with abcdc.com
or www.abcdc.com
or abcdc.[anything]
and is more extensible.
Depends on what you know about your url and exactly what you’re tryinh to do. If you know that it will always end in ‘.com’ (or ‘.net’ or ‘.org’) then
url=url[:-4]
is the quickest solution. If it’s a more general URLs then you’re probably better of looking into the urlparse library that comes with python.
If you on the other hand you simply want to remove everything after the final ‘.’ in a string then
url.rsplit('.',1)[0]
will work. Or if you want just want everything up to the first ‘.’ then try
url.split('.',1)[0]
def strip_end(text, suffix):
if suffix and text.endswith(suffix):
return text[:-len(suffix)]
return text
If you are sure that the string only appears at the end, then the simplest way would be to use ‘replace’:
url = 'abcdc.com'
print(url.replace('.com',''))
On Python 3.9+:
text.removesuffix(suffix)
On any Python version:
def remove_suffix(text, suffix):
return text[:-len(suffix)] if text.endswith(suffix) and len(suffix) != 0 else text
or the one-liner:
remove_suffix = lambda text, suffix: text[:-len(suffix)] if text.endswith(suffix) and len(suffix) != 0 else text
You can use split:
'abccomputer.com'.split('.com',1)[0]
# 'abccomputer'
For urls (as it seems to be a part of the topic by the given example), one can do something like this:
import os
url = 'http://www.stackoverflow.com'
name,ext = os.path.splitext(url)
print (name, ext)
#Or:
ext = '.'+url.split('.')[-1]
name = url[:-len(ext)]
print (name, ext)
Both will output:
('http://www.stackoverflow', '.com')
This can also be combined with str.endswith(suffix)
if you need to just split “.com”, or anything specific.
Since it seems like nobody has pointed this on out yet:
url = "www.example.com"
new_url = url[:url.rfind(".")]
This should be more efficient than the methods using split()
as no new list object is created, and this solution works for strings with several dots.
In my case I needed to raise an exception so I did:
class UnableToStripEnd(Exception):
"""A Exception type to indicate that the suffix cannot be removed from the text."""
@staticmethod
def get_exception(text, suffix):
return UnableToStripEnd("Could not find suffix ({0}) on text: {1}."
.format(suffix, text))
def strip_end(text, suffix):
"""Removes the end of a string. Otherwise fails."""
if not text.endswith(suffix):
raise UnableToStripEnd.get_exception(text, suffix)
return text[:len(text)-len(suffix)]
import re
def rm_suffix(url = 'abcdc.com', suffix='.com'):
return(re.sub(suffix+'$', '', url))
I want to repeat this answer as the most expressive way to do it. Of course, the following would take less CPU time:
def rm_dotcom(url = 'abcdc.com'):
return(url[:-4] if url.endswith('.com') else url)
However, if CPU is the bottle neck why write in Python?
When is CPU a bottle neck anyway? In drivers, maybe.
The advantages of using regular expression is code reusability. What if you next want to remove ‘.me’, which only has three characters?
Same code would do the trick:
>>> rm_sub('abcdc.me','.me')
'abcdc'
DSCLAIMER This method has a critical flaw in that the partition is not anchored to the end of the url and may return spurious results. For example, the result for the URL "www.comcast.net" is "www" (incorrect) instead of the expected "www.comcast.net". This solution therefore is evil. Don’t use it unless you know what you are doing!
url.rpartition('.com')[0]
This is fairly easy to type and also correctly returns the original string (no error) when the suffix ‘.com’ is missing from url
.
If you mean to only strip the extension:
'.'.join('abcdc.com'.split('.')[:-1])
# 'abcdc'
It works with any extension, with potential other dots existing in filename as well. It simply splits the string as a list on dots and joins it without the last element.
Here,i have a simplest code.
url=url.split(".")[0]
Assuming you want to remove the domain, no matter what it is (.com, .net, etc). I recommend finding the .
and removing everything from that point on.
url = 'abcdc.com'
dot_index = url.rfind('.')
url = url[:dot_index]
Here I’m using rfind
to solve the problem of urls like abcdc.com.net
which should be reduced to the name abcdc.com
.
If you’re also concerned about www.
s, you should explicitly check for them:
if url.startswith("www."):
url = url.replace("www.","", 1)
The 1 in replace is for strange edgecases like www.net.www.com
If your url gets any wilder than that look at the regex answers people have responded with.
If you need to strip some end of a string if it exists otherwise do nothing. My best solutions. You probably will want to use one of first 2 implementations however I have included the 3rd for completeness.
For a constant suffix:
def remove_suffix(v, s):
return v[:-len(s)] if v.endswith(s) else v
remove_suffix("abc.com", ".com") == 'abc'
remove_suffix("abc", ".com") == 'abc'
For a regex:
def remove_suffix_compile(suffix_pattern):
r = re.compile(f"(.*?)({suffix_pattern})?$")
return lambda v: r.match(v)[1]
remove_domain = remove_suffix_compile(r".[a-zA-Z0-9]{3,}")
remove_domain("abc.com") == "abc"
remove_domain("sub.abc.net") == "sub.abc"
remove_domain("abc.") == "abc."
remove_domain("abc") == "abc"
For a collection of constant suffixes the asymptotically fastest way for a large number of calls:
def remove_suffix_preprocess(*suffixes):
suffixes = set(suffixes)
try:
suffixes.remove('')
except KeyError:
pass
def helper(suffixes, pos):
if len(suffixes) == 1:
suf = suffixes[0]
l = -len(suf)
ls = slice(0, l)
return lambda v: v[ls] if v.endswith(suf) else v
si = iter(suffixes)
ml = len(next(si))
exact = False
for suf in si:
l = len(suf)
if -l == pos:
exact = True
else:
ml = min(len(suf), ml)
ml = -ml
suffix_dict = {}
for suf in suffixes:
sub = suf[ml:pos]
if sub in suffix_dict:
suffix_dict[sub].append(suf)
else:
suffix_dict[sub] = [suf]
if exact:
del suffix_dict['']
for key in suffix_dict:
suffix_dict[key] = helper([s[:pos] for s in suffix_dict[key]], None)
return lambda v: suffix_dict.get(v[ml:pos], lambda v: v)(v[:pos])
else:
for key in suffix_dict:
suffix_dict[key] = helper(suffix_dict[key], ml)
return lambda v: suffix_dict.get(v[ml:pos], lambda v: v)(v)
return helper(tuple(suffixes), None)
domain_remove = remove_suffix_preprocess(".com", ".net", ".edu", ".uk", '.tv', '.co.uk', '.org.uk')
the final one is probably significantly faster in pypy then cpython. The regex variant is likely faster than this for virtually all cases that do not involve huge dictionaries of potential suffixes that cannot be easily represented as a regex at least in cPython.
In PyPy the regex variant is almost certainly slower for large number of calls or long strings even if the re module uses a DFA compiling regex engine as the vast majority of the overhead of the lambda’s will be optimized out by the JIT.
In cPython however the fact that your running c code for the regex compare almost certainly outweighs the algorithmic advantages of the suffix collection version in almost all cases.
Edit: https://m.xkcd.com/859/
Starting in Python 3.9
, you can use removesuffix
instead:
'abcdc.com'.removesuffix('.com')
# 'abcdc'
I used the built-in rstrip function to do it like follow:
string = "test.com"
suffix = ".com"
newstring = string.rstrip(suffix)
print(newstring)
test
Python >= 3.9:
'abcdc.com'.removesuffix('.com')
Python < 3.9:
def remove_suffix(text, suffix):
if text.endswith(suffix):
text = text[:-len(suffix)]
return text
remove_suffix('abcdc.com', '.com')
A broader solution, adding the possibility to replace the suffix (you can remove by replacing with the empty string) and to set the maximum number of replacements:
def replacesuffix(s,old,new='',limit=1):
"""
String suffix replace; if the string ends with the suffix given by parameter `old`, such suffix is replaced with the string given by parameter `new`. The number of replacements is limited by parameter `limit`, unless `limit` is negative (meaning no limit).
:param s: the input string
:param old: the suffix to be replaced
:param new: the replacement string. Default value the empty string (suffix is removed without replacement).
:param limit: the maximum number of replacements allowed. Default value 1.
:returns: the input string with a certain number (depending on parameter `limit`) of the rightmost occurrences of string given by parameter `old` replaced by string given by parameter `new`
"""
if s[len(s)-len(old):] == old and limit != 0:
return replacesuffix(s[:len(s)-len(old)],old,new,limit-1) + new
else:
return s
In your case, given the default arguments, the desired result is obtained with:
replacesuffix('abcdc.com','.com')
>>> 'abcdc'
Some more general examples:
replacesuffix('whatever-qweqweqwe','qwe','N',2)
>>> 'whatever-qweNN'
replacesuffix('whatever-qweqweqwe','qwe','N',-1)
>>> 'whatever-NNN'
replacesuffix('12.53000','0',' ',-1)
>>> '12.53 '
Because this is a very popular question i add another, now available, solution. With python 3.9 (https://docs.python.org/3.9/whatsnew/3.9.html) the function removesuffix()
will be added (and removeprefix()
) and this function is exactly what was questioned here.
url = 'abcdc.com'
print(url.removesuffix('.com'))
output:
'abcdc'
PEP 616 (https://www.python.org/dev/peps/pep-0616/) shows how it will behave (it is not the real implementation):
def removeprefix(self: str, prefix: str, /) -> str:
if self.startswith(prefix):
return self[len(prefix):]
else:
return self[:]
and what benefits it has against self-implemented solutions:
-
Less fragile:
The code will not depend on the user to count the length of a literal.
-
More performant:
The code does not require a call to the Python built-in len function nor to the more expensive str.replace() method.
-
More descriptive:
The methods give a higher-level API for code readability as opposed to the traditional method of string slicing.
Using replace and count
This might seems a little bit a hack but it ensures you a safe replace without using startswith
and if statement, using the count
arg of replace you can limit the replace to one:
mystring = "www.comwww.com"
Prefix:
print(mystring.replace("www.","",1))
Suffix (you write the prefix reversed) .com
becomes moc.
:
print(mystring[::-1].replace("moc.","",1)[::-1])
I have the following code:
url = 'abcdc.com'
print(url.strip('.com'))
I expected: abcdc
I got: abcd
Now I do
url.rsplit('.com', 1)
Is there a better way?
See How do the .strip/.rstrip/.lstrip string methods work in Python? for a specific explanation of what the first attempt is doing.
strip
doesn’t mean "remove this substring". x.strip(y)
treats y
as a set of characters and strips any characters in that set from both ends of x
.
On Python 3.9 and newer you can use the removeprefix
and removesuffix
methods to remove an entire substring from either side of the string:
url = 'abcdc.com'
url.removesuffix('.com') # Returns 'abcdc'
url.removeprefix('abcdc.') # Returns 'com'
The relevant Python Enhancement Proposal is PEP-616.
On Python 3.8 and older you can use endswith
and slicing:
url = 'abcdc.com'
if url.endswith('.com'):
url = url[:-4]
Or a regular expression:
import re
url = 'abcdc.com'
url = re.sub('.com$', '', url)
How about url[:-4]
?
This is a perfect use for regular expressions:
>>> import re
>>> re.match(r"(.*).com", "hello.com").group(1)
'hello'
If you know it’s an extension, then
url = 'abcdc.com'
...
url.rsplit('.', 1)[0] # split at '.', starting from the right, maximum 1 split
This works equally well with abcdc.com
or www.abcdc.com
or abcdc.[anything]
and is more extensible.
Depends on what you know about your url and exactly what you’re tryinh to do. If you know that it will always end in ‘.com’ (or ‘.net’ or ‘.org’) then
url=url[:-4]
is the quickest solution. If it’s a more general URLs then you’re probably better of looking into the urlparse library that comes with python.
If you on the other hand you simply want to remove everything after the final ‘.’ in a string then
url.rsplit('.',1)[0]
will work. Or if you want just want everything up to the first ‘.’ then try
url.split('.',1)[0]
def strip_end(text, suffix):
if suffix and text.endswith(suffix):
return text[:-len(suffix)]
return text
If you are sure that the string only appears at the end, then the simplest way would be to use ‘replace’:
url = 'abcdc.com'
print(url.replace('.com',''))
On Python 3.9+:
text.removesuffix(suffix)
On any Python version:
def remove_suffix(text, suffix):
return text[:-len(suffix)] if text.endswith(suffix) and len(suffix) != 0 else text
or the one-liner:
remove_suffix = lambda text, suffix: text[:-len(suffix)] if text.endswith(suffix) and len(suffix) != 0 else text
You can use split:
'abccomputer.com'.split('.com',1)[0]
# 'abccomputer'
For urls (as it seems to be a part of the topic by the given example), one can do something like this:
import os
url = 'http://www.stackoverflow.com'
name,ext = os.path.splitext(url)
print (name, ext)
#Or:
ext = '.'+url.split('.')[-1]
name = url[:-len(ext)]
print (name, ext)
Both will output:
('http://www.stackoverflow', '.com')
This can also be combined with str.endswith(suffix)
if you need to just split “.com”, or anything specific.
Since it seems like nobody has pointed this on out yet:
url = "www.example.com"
new_url = url[:url.rfind(".")]
This should be more efficient than the methods using split()
as no new list object is created, and this solution works for strings with several dots.
In my case I needed to raise an exception so I did:
class UnableToStripEnd(Exception):
"""A Exception type to indicate that the suffix cannot be removed from the text."""
@staticmethod
def get_exception(text, suffix):
return UnableToStripEnd("Could not find suffix ({0}) on text: {1}."
.format(suffix, text))
def strip_end(text, suffix):
"""Removes the end of a string. Otherwise fails."""
if not text.endswith(suffix):
raise UnableToStripEnd.get_exception(text, suffix)
return text[:len(text)-len(suffix)]
import re
def rm_suffix(url = 'abcdc.com', suffix='.com'):
return(re.sub(suffix+'$', '', url))
I want to repeat this answer as the most expressive way to do it. Of course, the following would take less CPU time:
def rm_dotcom(url = 'abcdc.com'):
return(url[:-4] if url.endswith('.com') else url)
However, if CPU is the bottle neck why write in Python?
When is CPU a bottle neck anyway? In drivers, maybe.
The advantages of using regular expression is code reusability. What if you next want to remove ‘.me’, which only has three characters?
Same code would do the trick:
>>> rm_sub('abcdc.me','.me')
'abcdc'
DSCLAIMER This method has a critical flaw in that the partition is not anchored to the end of the url and may return spurious results. For example, the result for the URL "www.comcast.net" is "www" (incorrect) instead of the expected "www.comcast.net". This solution therefore is evil. Don’t use it unless you know what you are doing!
url.rpartition('.com')[0]
This is fairly easy to type and also correctly returns the original string (no error) when the suffix ‘.com’ is missing from url
.
If you mean to only strip the extension:
'.'.join('abcdc.com'.split('.')[:-1])
# 'abcdc'
It works with any extension, with potential other dots existing in filename as well. It simply splits the string as a list on dots and joins it without the last element.
Here,i have a simplest code.
url=url.split(".")[0]
Assuming you want to remove the domain, no matter what it is (.com, .net, etc). I recommend finding the .
and removing everything from that point on.
url = 'abcdc.com'
dot_index = url.rfind('.')
url = url[:dot_index]
Here I’m using rfind
to solve the problem of urls like abcdc.com.net
which should be reduced to the name abcdc.com
.
If you’re also concerned about www.
s, you should explicitly check for them:
if url.startswith("www."):
url = url.replace("www.","", 1)
The 1 in replace is for strange edgecases like www.net.www.com
If your url gets any wilder than that look at the regex answers people have responded with.
If you need to strip some end of a string if it exists otherwise do nothing. My best solutions. You probably will want to use one of first 2 implementations however I have included the 3rd for completeness.
For a constant suffix:
def remove_suffix(v, s):
return v[:-len(s)] if v.endswith(s) else v
remove_suffix("abc.com", ".com") == 'abc'
remove_suffix("abc", ".com") == 'abc'
For a regex:
def remove_suffix_compile(suffix_pattern):
r = re.compile(f"(.*?)({suffix_pattern})?$")
return lambda v: r.match(v)[1]
remove_domain = remove_suffix_compile(r".[a-zA-Z0-9]{3,}")
remove_domain("abc.com") == "abc"
remove_domain("sub.abc.net") == "sub.abc"
remove_domain("abc.") == "abc."
remove_domain("abc") == "abc"
For a collection of constant suffixes the asymptotically fastest way for a large number of calls:
def remove_suffix_preprocess(*suffixes):
suffixes = set(suffixes)
try:
suffixes.remove('')
except KeyError:
pass
def helper(suffixes, pos):
if len(suffixes) == 1:
suf = suffixes[0]
l = -len(suf)
ls = slice(0, l)
return lambda v: v[ls] if v.endswith(suf) else v
si = iter(suffixes)
ml = len(next(si))
exact = False
for suf in si:
l = len(suf)
if -l == pos:
exact = True
else:
ml = min(len(suf), ml)
ml = -ml
suffix_dict = {}
for suf in suffixes:
sub = suf[ml:pos]
if sub in suffix_dict:
suffix_dict[sub].append(suf)
else:
suffix_dict[sub] = [suf]
if exact:
del suffix_dict['']
for key in suffix_dict:
suffix_dict[key] = helper([s[:pos] for s in suffix_dict[key]], None)
return lambda v: suffix_dict.get(v[ml:pos], lambda v: v)(v[:pos])
else:
for key in suffix_dict:
suffix_dict[key] = helper(suffix_dict[key], ml)
return lambda v: suffix_dict.get(v[ml:pos], lambda v: v)(v)
return helper(tuple(suffixes), None)
domain_remove = remove_suffix_preprocess(".com", ".net", ".edu", ".uk", '.tv', '.co.uk', '.org.uk')
the final one is probably significantly faster in pypy then cpython. The regex variant is likely faster than this for virtually all cases that do not involve huge dictionaries of potential suffixes that cannot be easily represented as a regex at least in cPython.
In PyPy the regex variant is almost certainly slower for large number of calls or long strings even if the re module uses a DFA compiling regex engine as the vast majority of the overhead of the lambda’s will be optimized out by the JIT.
In cPython however the fact that your running c code for the regex compare almost certainly outweighs the algorithmic advantages of the suffix collection version in almost all cases.
Edit: https://m.xkcd.com/859/
Starting in Python 3.9
, you can use removesuffix
instead:
'abcdc.com'.removesuffix('.com')
# 'abcdc'
I used the built-in rstrip function to do it like follow:
string = "test.com"
suffix = ".com"
newstring = string.rstrip(suffix)
print(newstring)
test
Python >= 3.9:
'abcdc.com'.removesuffix('.com')
Python < 3.9:
def remove_suffix(text, suffix):
if text.endswith(suffix):
text = text[:-len(suffix)]
return text
remove_suffix('abcdc.com', '.com')
A broader solution, adding the possibility to replace the suffix (you can remove by replacing with the empty string) and to set the maximum number of replacements:
def replacesuffix(s,old,new='',limit=1):
"""
String suffix replace; if the string ends with the suffix given by parameter `old`, such suffix is replaced with the string given by parameter `new`. The number of replacements is limited by parameter `limit`, unless `limit` is negative (meaning no limit).
:param s: the input string
:param old: the suffix to be replaced
:param new: the replacement string. Default value the empty string (suffix is removed without replacement).
:param limit: the maximum number of replacements allowed. Default value 1.
:returns: the input string with a certain number (depending on parameter `limit`) of the rightmost occurrences of string given by parameter `old` replaced by string given by parameter `new`
"""
if s[len(s)-len(old):] == old and limit != 0:
return replacesuffix(s[:len(s)-len(old)],old,new,limit-1) + new
else:
return s
In your case, given the default arguments, the desired result is obtained with:
replacesuffix('abcdc.com','.com')
>>> 'abcdc'
Some more general examples:
replacesuffix('whatever-qweqweqwe','qwe','N',2)
>>> 'whatever-qweNN'
replacesuffix('whatever-qweqweqwe','qwe','N',-1)
>>> 'whatever-NNN'
replacesuffix('12.53000','0',' ',-1)
>>> '12.53 '
Because this is a very popular question i add another, now available, solution. With python 3.9 (https://docs.python.org/3.9/whatsnew/3.9.html) the function removesuffix()
will be added (and removeprefix()
) and this function is exactly what was questioned here.
url = 'abcdc.com'
print(url.removesuffix('.com'))
output:
'abcdc'
PEP 616 (https://www.python.org/dev/peps/pep-0616/) shows how it will behave (it is not the real implementation):
def removeprefix(self: str, prefix: str, /) -> str:
if self.startswith(prefix):
return self[len(prefix):]
else:
return self[:]
and what benefits it has against self-implemented solutions:
-
Less fragile:
The code will not depend on the user to count the length of a literal. -
More performant:
The code does not require a call to the Python built-in len function nor to the more expensive str.replace() method. -
More descriptive:
The methods give a higher-level API for code readability as opposed to the traditional method of string slicing.
Using replace and count
This might seems a little bit a hack but it ensures you a safe replace without using startswith
and if statement, using the count
arg of replace you can limit the replace to one:
mystring = "www.comwww.com"
Prefix:
print(mystring.replace("www.","",1))
Suffix (you write the prefix reversed) .com
becomes moc.
:
print(mystring[::-1].replace("moc.","",1)[::-1])