Regular expression for removing all URLs in a string in Python

Question:

I want to delete all the URLs in the sentence.

Here is my code:

import ijson
f = open("/content/drive/My Drive/PTT 爬蟲/content/MakeUp/PTT_MakeUp_content_0_1000.json")
objects = ijson.items(f, 'item')

for obj in list(objects):
    article = obj['content']
    ret = re.findall("http[s*]:[a-zA-Z0-9_.+-/#~]+ ", article) # Question here
    for r in ret:
        article = article.replace(r, "")
    print(article)

But a URL with "http" is still left in the sentence.

article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"

How can I fix it?

Asked By: ching-yu

||

Answers:

The URL starts with http and in your pattern you match [s*] which will match either a s or * in the character class.

I think you are looking for

https?:[a-zA-Z0-9_.+-/#~]+

Regex demo | Python demo

import re
regex = r"https?:[a-zA-Z0-9_.+-/#~]+ "
article = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
result = re.sub(regex, "", article)
print(result)

Result

眼影盤長這樣 說真的 很不好拍

A shortened expression, which is a bit broader match, could also be matching a non whitespace S+ char one or more times, followed by a space zero or more times to match the trailing space as in your original pattern.

bhttps?:S+ *

Regex demo

Answered By: The fourth bird

Change the [s*] to s?. The former is a set of two characters. The latter is an optional character. There are websites like regex101.com that let you experiment with regular expressions in the Python dialect. It will explain the interpretation of each part of the regex.

Answered By: gilch

One simple fix would be to just replace the pattern https?://S+ with an empty string:

article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
output = re.sub(r'https?://S+', '', article_example)
print(output)

This prints:

眼影盤長這樣  說真的 很不好拍

My pattern assumes that whatever non whitespace characters which follow http:// or https:// are part of the URL.

Answered By: Tim Biegeleisen
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.