How to remove any URL within a string in Python
Question:
I want to remove all URLs inside a string (replace them with “”)
I searched around but couldn’t really find what I want.
Example:
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/
I want the result to be:
text1
text2
text3
text4
text5
text6
Answers:
This worked for me:
import re
thestring = "text1ntext2nhttp://url.com/bla1/blah1/ntext3ntext4nhttp://url.com/bla2/blah2/ntext5ntext6"
URLless_string = re.sub(r'w+:/{2}[dw-]+(.[dw-]+)*(?:(?:/[^s/]*))*', '', thestring)
print URLless_string
Result:
text1
text2
text3
text4
text5
text6
Python script:
import re
text = re.sub(r'^https?://.*[rn]*', '', text, flags=re.MULTILINE)
Output:
text1
text2
text3
text4
text5
text6
Test this code here.
You could also look at it from the other way around…
from urlparse import urlparse
[el for el in ['text1', 'FTP://somewhere.com', 'text2', 'http://blah.com:8080/foo/bar#header'] if not urlparse(el).scheme]
This solution caters for http, https and the other normal url type special characters :
import re
def remove_urls (vTEXT):
vTEXT = re.sub(r'(https|http)?://(w|.|/|?|=|&|%)*b', '', vTEXT, flags=re.MULTILINE)
return(vTEXT)
print( remove_urls("this is a test https://sdfs.sdfsdf.com/sdfsdf/sdfsdf/sd/sdfsdfs?bob=%20tree&jef=man lets see this too https://sdfsdf.fdf.com/sdf/f end"))
the shortest way
re.sub(r'httpS+', '', stringliteral)
The following regular expression in Python works well for detecting URL(s) in the text:
source_text = '''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6 '''
import re
url_reg = r'[a-z]*[:.]+S+'
result = re.sub(url_reg, '', source_text)
print(result)
Output:
text1
text2
text3
text4
text5
text6
I know this has already been answered and its stupid late but I think this should be here. This is a regex that matches any kind of url.
[^ ]+.[^ ]+
It can be used like
re.sub('[^ ]+.[^ ]+','',sentence)
Removal of HTTP links/URLs mixed up in any text:
import re
re.sub(r'''(?i)b((?:https?://|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))*))+(?:(([^s()<>]+|(([^s()<>]+)))*)|[^s`!()[]{};:'".,<>?«»“”‘’]))''', " ", text)
I wasn’t able to find any that handled my particular situation, which was removing urls in the middle of tweets that also have whitespaces in the middle of urls so I made my own:
(https?://)(s)*(www.)?(s)*((w|s)+.)*([w-s]+/)*([w-]+)((?)?[ws]*=s*[w%&]*)*
here’s an explanation:
(https?://)
matches http:// or https://
(s)*
optional whitespaces
(www.)?
optionally matches www.
(s)*
optionally matches whitespaces
((w|s)+.)*
matches 0 or more of one or more word characters followed by a period
([w-s]+/)*
matches 0 or more of one or more words(or a dash or a space) followed by ”
([w-]+)
any remaining path at the end of the url followed by an optional ending
((?)?[ws]*=s*[w%&]*)*
matches ending query params (even with white spaces,etc)
test this out here:https://regex101.com/r/NmVGOo/8
What you really want to do is to remove any string that starts with either http://
or https://
plus any combination of non white space characters. Here is how I would solve it. My solution is very similar to that of @tolgayilmaz
#Define the text from which you want to replace the url with "".
text ='''The link to this post is https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python'''
import re
#Either use:
re.sub('http://S+|https://S+', '', text)
#OR
re.sub('http[s]?://S+', '', text)
And the result of running either code above is
>>> 'The link to this post is '
I prefer the second one because it is more readable.
import re
s = '''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/'''
g = re.findall(r'(textd+)',s)
print ('list',g)
for i in g:
print (i)
Out
list ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']
text1
text2
text3
text4
text5
text6
In order to remove any URL within a string in Python, you can use this RegEx function :
import re
def remove_URL(text):
"""Remove URLs from a text string"""
return re.sub(r"httpS+", "", text)
I think the most general URL regex pattern is this one:
URL_PATTERN = r'[A-Za-z0-9]+://[A-Za-z0-9%-_]+(/[A-Za-z0-9%-_])*(#|\?)[A-Za-z0-9%-_&=]*'
There is a small module that does what do you want:
pip install mysmallutils
from mysutils.text import remove_urls
remove_urls(text)
A simple .* with a positive look behind should do the job.
text="text1ntext2nhttp://url.com/bla1/blah1/ntext3ntext4nhttp://url.com/bla2/blah2/ntext5ntext6"
req=re.sub(r'http.*?(?=s)', " ", text)
print(req)
why do not use this its so complete
i = re.sub(r"(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)","",i)
I want to remove all URLs inside a string (replace them with “”)
I searched around but couldn’t really find what I want.
Example:
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/
I want the result to be:
text1
text2
text3
text4
text5
text6
This worked for me:
import re
thestring = "text1ntext2nhttp://url.com/bla1/blah1/ntext3ntext4nhttp://url.com/bla2/blah2/ntext5ntext6"
URLless_string = re.sub(r'w+:/{2}[dw-]+(.[dw-]+)*(?:(?:/[^s/]*))*', '', thestring)
print URLless_string
Result:
text1
text2
text3
text4
text5
text6
Python script:
import re
text = re.sub(r'^https?://.*[rn]*', '', text, flags=re.MULTILINE)
Output:
text1
text2
text3
text4
text5
text6
Test this code here.
You could also look at it from the other way around…
from urlparse import urlparse
[el for el in ['text1', 'FTP://somewhere.com', 'text2', 'http://blah.com:8080/foo/bar#header'] if not urlparse(el).scheme]
This solution caters for http, https and the other normal url type special characters :
import re
def remove_urls (vTEXT):
vTEXT = re.sub(r'(https|http)?://(w|.|/|?|=|&|%)*b', '', vTEXT, flags=re.MULTILINE)
return(vTEXT)
print( remove_urls("this is a test https://sdfs.sdfsdf.com/sdfsdf/sdfsdf/sd/sdfsdfs?bob=%20tree&jef=man lets see this too https://sdfsdf.fdf.com/sdf/f end"))
the shortest way
re.sub(r'httpS+', '', stringliteral)
The following regular expression in Python works well for detecting URL(s) in the text:
source_text = '''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6 '''
import re
url_reg = r'[a-z]*[:.]+S+'
result = re.sub(url_reg, '', source_text)
print(result)
Output:
text1
text2
text3
text4
text5
text6
I know this has already been answered and its stupid late but I think this should be here. This is a regex that matches any kind of url.
[^ ]+.[^ ]+
It can be used like
re.sub('[^ ]+.[^ ]+','',sentence)
Removal of HTTP links/URLs mixed up in any text:
import re
re.sub(r'''(?i)b((?:https?://|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))*))+(?:(([^s()<>]+|(([^s()<>]+)))*)|[^s`!()[]{};:'".,<>?«»“”‘’]))''', " ", text)
I wasn’t able to find any that handled my particular situation, which was removing urls in the middle of tweets that also have whitespaces in the middle of urls so I made my own:
(https?://)(s)*(www.)?(s)*((w|s)+.)*([w-s]+/)*([w-]+)((?)?[ws]*=s*[w%&]*)*
here’s an explanation:
(https?://)
matches http:// or https://
(s)*
optional whitespaces
(www.)?
optionally matches www.
(s)*
optionally matches whitespaces
((w|s)+.)*
matches 0 or more of one or more word characters followed by a period
([w-s]+/)*
matches 0 or more of one or more words(or a dash or a space) followed by ”
([w-]+)
any remaining path at the end of the url followed by an optional ending
((?)?[ws]*=s*[w%&]*)*
matches ending query params (even with white spaces,etc)
test this out here:https://regex101.com/r/NmVGOo/8
What you really want to do is to remove any string that starts with either http://
or https://
plus any combination of non white space characters. Here is how I would solve it. My solution is very similar to that of @tolgayilmaz
#Define the text from which you want to replace the url with "".
text ='''The link to this post is https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python'''
import re
#Either use:
re.sub('http://S+|https://S+', '', text)
#OR
re.sub('http[s]?://S+', '', text)
And the result of running either code above is
>>> 'The link to this post is '
I prefer the second one because it is more readable.
import re
s = '''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/'''
g = re.findall(r'(textd+)',s)
print ('list',g)
for i in g:
print (i)
Out
list ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']
text1
text2
text3
text4
text5
text6
In order to remove any URL within a string in Python, you can use this RegEx function :
import re
def remove_URL(text):
"""Remove URLs from a text string"""
return re.sub(r"httpS+", "", text)
I think the most general URL regex pattern is this one:
URL_PATTERN = r'[A-Za-z0-9]+://[A-Za-z0-9%-_]+(/[A-Za-z0-9%-_])*(#|\?)[A-Za-z0-9%-_&=]*'
There is a small module that does what do you want:
pip install mysmallutils
from mysutils.text import remove_urls
remove_urls(text)
A simple .* with a positive look behind should do the job.
text="text1ntext2nhttp://url.com/bla1/blah1/ntext3ntext4nhttp://url.com/bla2/blah2/ntext5ntext6"
req=re.sub(r'http.*?(?=s)', " ", text)
print(req)
why do not use this its so complete
i = re.sub(r"(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)","",i)