How to extract a filename from a URL and append a word to it?

Question:

I have the following URL:

url = http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg

I would like to extract the file name in this URL: 09-09-201315-47-571378756077.jpg

Once I get this file name, I’m going to save it with this name to the Desktop.

filename = **extracted file name from the url**     
download_photo = urllib.urlretrieve(url, "/home/ubuntu/Desktop/%s.jpg" % (filename))

After this, I’m going to resize the photo, once that is done, I’ve going to save the resized version and append the word "_small" to the end of the filename.

downloadedphoto = Image.open("/home/ubuntu/Desktop/%s.jpg" % (filename))               
resize_downloadedphoto = downloadedphoto.resize.((300, 300), Image.ANTIALIAS)
resize_downloadedphoto.save("/home/ubuntu/Desktop/%s.jpg" % (filename + _small))

From this, what I am trying to achieve is to get two files, the original photo with the original name, then the resized photo with the modified name. Like so:

09-09-201315-47-571378756077.jpg

rename to:

09-09-201315-47-571378756077_small.jpg

How can I go about doing this?

Asked By: deadlock

||

Answers:

filename = url[url.rfind("/")+1:]
filename_small = filename.replace(".", "_small.")

maybe use “.jpg” in the last case since a . can also be in the filename.

Answered By: RickyA

You can use urllib.parse.urlparse with os.path.basename:

import os
from urllib.parse import urlparse

url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

Your URL might contain percent-encoded characters like %20 for space or %E7%89%B9%E8%89%B2 for "特色". If that’s the case, you’ll need to unquote (or unquote_plus) them. You can also use pathlib.Path().name instead of os.path.basename, which could help to add a suffix in the name (like asked in the original question):

from pathlib import Path
from urllib.parse import urlparse, unquote

url = "http://photographs.500px.com/kyle/09-09-2013%20-%2015-47-571378756077.jpg"
urlparse(url).path

url_parsed = urlparse(url)
print(unquote(url_parsed.path))  # Output: /kyle/09-09-2013 - 15-47-571378756077.jpg
file_path = Path("/home/ubuntu/Desktop/") / unquote(Path(url_parsed.path).name)
print(file_path)        # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077.jpg

new_file = file_path.with_stem(file_path.stem + "_small")
print(new_file)         # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077_small.jpg

Also, an alternative is to use unquote(urlparse(url).path.split("/")[-1]).

Answered By: Ofir Israel

Python split url to find image name and extension

helps you to extract the image name. to append name :

imageName =  '09-09-201315-47-571378756077'

new_name = '{0}_small.jpg'.format(imageName) 
Answered By: Moj

You could just split the url by "/" and retrieve the last member of the list:

url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
filename = url.split("/")[-1] 
#09-09-201315-47-571378756077.jpg

Then use replace to change the ending:

small_jpg = filename.replace(".jpg", "_small.jpg")
#09-09-201315-47-571378756077_small.jpg
Answered By: Bryan

os.path.basename(url)

Why try harder?

In [1]: os.path.basename("https://example.com/file.html")
Out[1]: 'file.html'

In [2]: os.path.basename("https://example.com/file")
Out[2]: 'file'

In [3]: os.path.basename("https://example.com/")
Out[3]: ''

In [4]: os.path.basename("https://example.com")
Out[4]: 'example.com'

Note 2020-12-20

Nobody has thus far provided a complete solution.

A URL can contain a ?[query-string] and/or a #[fragment Identifier] (but only in that order: ref)

In [1]: from os import path

In [2]: def get_filename(url):
   ...:     fragment_removed = url.split("#")[0]  # keep to left of first #
   ...:     query_string_removed = fragment_removed.split("?")[0]
   ...:     scheme_removed = query_string_removed.split("://")[-1].split(":")[-1]
   ...:     if scheme_removed.find("/") == -1:
   ...:         return ""
   ...:     return path.basename(scheme_removed)
   ...:

In [3]: get_filename("a.com/b")
Out[3]: 'b'

In [4]: get_filename("a.com/")
Out[4]: ''

In [5]: get_filename("https://a.com/")
Out[5]: ''

In [6]: get_filename("https://a.com/b")
Out[6]: 'b'

In [7]: get_filename("https://a.com/b?c=d#e")
Out[7]: 'b'
Answered By: P i

Sometimes there is a query string:

filename = url.split("/")[-1].split("?")[0] 
new_filename = filename.replace(".jpg", "_small.jpg")
Answered By: user2821

Use urllib.parse.urlparse to get just the path part of the URL, and then use pathlib.Path on that path to get the filename:

from urllib.parse import urlparse
from pathlib import Path


url = "http://example.com/some/long/path/a_filename.jpg?some_query_params=true&some_more=true#and-an-anchor"
a = urlparse(url)
a.path             # '/some/long/path/a_filename.jpg'
Path(a.path).name  # 'a_filename.jpg'
Answered By: Boris Verkhovskiy

We can extract filename from a url by using ntpath module.

import ntpath
url = 'http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg'
name, ext = ntpath.splitext(ntpath.basename(url))
# 09-09-201315-47-571378756077  .jpg


print(name + '_small' + ext)
09-09-201315-47-571378756077_small.jpg
Answered By: user13415013

With python3 (from 3.4 upwards) you can abuse the pathlib library in the following way:

from pathlib import Path

p = Path('http://example.com/somefile.html')
print(p.name)
# >>> 'somefile.html'

print(p.stem)
# >>> 'somefile'

print(p.suffix)
# >>> '.html'

print(f'{p.stem}-spamspam{p.suffix}')
# >>> 'somefile-spamspam.html'

❗️ WARNING

The pathlib module is NOT meant for parsing URLs — it is designed to work with POSIX paths and not with URLs. Don’t use it in production code! It’s a dirty quick hack for non-critical code. The code is only provided as an example of what you can do but probably should not do. If you need to parse URLs then go with urllib.parse or alternatives.

Answered By: ccpizza

A simple version using the os package:

import os

def get_url_file_name(url):
    url = url.split("#")[0]
    url = url.split("?")[0]
    return os.path.basename(url)

Examples:

print(get_url_file_name("example.com/myfile.tar.gz"))  # 'myfile.tar.gz'
print(get_url_file_name("example.com/"))  # ''
print(get_url_file_name("https://example.com/"))  # ''
print(get_url_file_name("https://example.com/hello.zip"))  # 'hello.zip'
print(get_url_file_name("https://example.com/args.tar.gz?c=d#e"))  # 'args.tar.gz'

Sometimes the link you have can have redirects (that was the case for me). In that case you have to solve the redirects

import requests
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
response = requests.head(url)
url = response.url

then you can continue with the best answer at the moment (Ofir’s)

import os
from urllib.parse import urlparse


a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

it doesn’t work with this page however, as the page isn’t available anymore

Answered By: GuiTaek

I see people using the Pathlib library to parse URLs. This is not a good idea! Pathlib is not designed for it, use special libraries like urllib or similar instead.

This is the most stable version I could come up with. It handles params as well as fragments:

from urllib.parse import urlparse, ParseResult

def update_filename(url):
    parsed_url = urlparse(url)
    path = parsed_url.path

    filename = path[path.rfind('/') + 1:]

    if not filename:
        return

    file, extension = filename.rsplit('.', 1)

    new_path = parsed_url.path.replace(filename, f"{file}_small.{extension}")
    parsed_url = ParseResult(**{**parsed_url._asdict(), 'path': new_path})

    return parsed_url.geturl()

Example:

assert update_filename('https://example.com/') is None
assert update_filename('https://example.com/path/to/') is None
assert update_filename('https://example.com/path/to/report.pdf') == 'https://example.com/path/to/report_small.pdf'
assert update_filename('https://example.com/path/to/filename with spaces.pdf') == 'https://example.com/path/to/filename with spaces_small.pdf'
assert update_filename('https://example.com/path/to/report_01.01.2022.pdf') == 'https://example.com/path/to/report_01.01.2022_small.pdf'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2#test') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2#test'
Answered By: funnydman
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.