Python regex to get the closest match without duplicated content

Question:

What I need

I have a list of img src link. Here is an example:

  • https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg&nocache=1
  • https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg&nocache=1
  • https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png&nocache=1

I need get the following result:

studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg

studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg

studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png

Problem

I use the following regex:

studiocake.kiev.ua.*(jpeg|png|jpg)

But it doesn’t work the way I need. Instead of the result I need, I get link like:

studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg 

Question

How can I get the result I need with Python regex

Asked By: TiBrains

||

Answers:

What you want to achieve, is a standard operation on URLs, and python has good number of libraries to achieve that. Instead of using regexes for this exercise, I would recommend using a url parsing library, which provides standard operations, and provides better code.

from urllib.parse import urlparse, parse_qs


def extractSrc(strUrl):
  # Parse original URL using urllib
  parsed_url = urlparse(strUrl)

  # Find the value of query parameter img
  src_value = parse_qs(parsed_url.query)['src'][0]
  
  # Again, using same library, parse img url which we got above.
  img_parsed_url = urlparse(src_value)

  # Remove the scheme in the img URL and return result.
  scheme = "%s://" % img_parsed_url.scheme
  return img_parsed_url.geturl().replace(scheme, '', 1)



urls = '''https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg&nocache=1
https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg&nocache=1
https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png&nocache=1'''

for u in urls.split('n'):
  print(extractSrc(u))

Output:

studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg
studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg
studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png
Answered By: Yogesh Kumar Gupta

You can let a greedy .* consume the starting match and capture the latter.

import re

matches = re.findall(r"(?i).*b(studiocake.kiev.uaS*b(?:jpeg|png|jpg))b", s)

See this demo at regex101 (matches in group 1) or a Python demo at tio.run


Inside used S* to match any amount of characters other than a whitespace.
I further added some b word boundaries and the (?i)-flag for ignore case.

Answered By: bobble bubble

My hack expression is this:

(https://)(studiocake.kiev.ua.*(php)?src=https://)(studiocake.kiev.ua.*(jpeg|png|jpg))(&nocache=1)

To replace it with $4

Explanation…

I just selected all the link in parts and then replaced it with the particular part needed.

Answered By: Phyln
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.