How to extract a filename from a URL and append a word to it?
Question:
I have the following URL:
url = http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg
I would like to extract the file name in this URL: 09-09-201315-47-571378756077.jpg
Once I get this file name, I’m going to save it with this name to the Desktop.
filename = **extracted file name from the url**
download_photo = urllib.urlretrieve(url, "/home/ubuntu/Desktop/%s.jpg" % (filename))
After this, I’m going to resize the photo, once that is done, I’ve going to save the resized version and append the word "_small" to the end of the filename.
downloadedphoto = Image.open("/home/ubuntu/Desktop/%s.jpg" % (filename))
resize_downloadedphoto = downloadedphoto.resize.((300, 300), Image.ANTIALIAS)
resize_downloadedphoto.save("/home/ubuntu/Desktop/%s.jpg" % (filename + _small))
From this, what I am trying to achieve is to get two files, the original photo with the original name, then the resized photo with the modified name. Like so:
09-09-201315-47-571378756077.jpg
rename to:
09-09-201315-47-571378756077_small.jpg
How can I go about doing this?
Answers:
filename = url[url.rfind("/")+1:]
filename_small = filename.replace(".", "_small.")
maybe use “.jpg” in the last case since a . can also be in the filename.
You can use urllib.parse.urlparse
with os.path.basename
:
import os
from urllib.parse import urlparse
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
a = urlparse(url)
print(a.path) # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path)) # Output: 09-09-201315-47-571378756077.jpg
Your URL might contain percent-encoded characters like %20
for space or %E7%89%B9%E8%89%B2
for "特色". If that’s the case, you’ll need to unquote
(or unquote_plus
) them. You can also use pathlib.Path().name
instead of os.path.basename
, which could help to add a suffix in the name (like asked in the original question):
from pathlib import Path
from urllib.parse import urlparse, unquote
url = "http://photographs.500px.com/kyle/09-09-2013%20-%2015-47-571378756077.jpg"
urlparse(url).path
url_parsed = urlparse(url)
print(unquote(url_parsed.path)) # Output: /kyle/09-09-2013 - 15-47-571378756077.jpg
file_path = Path("/home/ubuntu/Desktop/") / unquote(Path(url_parsed.path).name)
print(file_path) # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077.jpg
new_file = file_path.with_stem(file_path.stem + "_small")
print(new_file) # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077_small.jpg
Also, an alternative is to use unquote(urlparse(url).path.split("/")[-1])
.
Python split url to find image name and extension
helps you to extract the image name. to append name :
imageName = '09-09-201315-47-571378756077'
new_name = '{0}_small.jpg'.format(imageName)
You could just split the url by "/" and retrieve the last member of the list:
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
filename = url.split("/")[-1]
#09-09-201315-47-571378756077.jpg
Then use replace
to change the ending:
small_jpg = filename.replace(".jpg", "_small.jpg")
#09-09-201315-47-571378756077_small.jpg
os.path.basename(url)
Why try harder?
In [1]: os.path.basename("https://example.com/file.html")
Out[1]: 'file.html'
In [2]: os.path.basename("https://example.com/file")
Out[2]: 'file'
In [3]: os.path.basename("https://example.com/")
Out[3]: ''
In [4]: os.path.basename("https://example.com")
Out[4]: 'example.com'
Note 2020-12-20
Nobody has thus far provided a complete solution.
A URL can contain a ?[query-string]
and/or a #[fragment Identifier]
(but only in that order: ref)
In [1]: from os import path
In [2]: def get_filename(url):
...: fragment_removed = url.split("#")[0] # keep to left of first #
...: query_string_removed = fragment_removed.split("?")[0]
...: scheme_removed = query_string_removed.split("://")[-1].split(":")[-1]
...: if scheme_removed.find("/") == -1:
...: return ""
...: return path.basename(scheme_removed)
...:
In [3]: get_filename("a.com/b")
Out[3]: 'b'
In [4]: get_filename("a.com/")
Out[4]: ''
In [5]: get_filename("https://a.com/")
Out[5]: ''
In [6]: get_filename("https://a.com/b")
Out[6]: 'b'
In [7]: get_filename("https://a.com/b?c=d#e")
Out[7]: 'b'
Sometimes there is a query string:
filename = url.split("/")[-1].split("?")[0]
new_filename = filename.replace(".jpg", "_small.jpg")
Use urllib.parse.urlparse
to get just the path part of the URL, and then use pathlib.Path
on that path to get the filename:
from urllib.parse import urlparse
from pathlib import Path
url = "http://example.com/some/long/path/a_filename.jpg?some_query_params=true&some_more=true#and-an-anchor"
a = urlparse(url)
a.path # '/some/long/path/a_filename.jpg'
Path(a.path).name # 'a_filename.jpg'
We can extract filename from a url by using ntpath module.
import ntpath
url = 'http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg'
name, ext = ntpath.splitext(ntpath.basename(url))
# 09-09-201315-47-571378756077 .jpg
print(name + '_small' + ext)
09-09-201315-47-571378756077_small.jpg
With python3 (from 3.4 upwards) you can abuse the pathlib
library in the following way:
from pathlib import Path
p = Path('http://example.com/somefile.html')
print(p.name)
# >>> 'somefile.html'
print(p.stem)
# >>> 'somefile'
print(p.suffix)
# >>> '.html'
print(f'{p.stem}-spamspam{p.suffix}')
# >>> 'somefile-spamspam.html'
❗️ WARNING
The pathlib
module is NOT meant for parsing URLs — it is designed to work with POSIX paths and not with URLs. Don’t use it in production code! It’s a dirty quick hack for non-critical code. The code is only provided as an example of what you can do but probably should not do. If you need to parse URLs then go with urllib.parse or alternatives.
A simple version using the os
package:
import os
def get_url_file_name(url):
url = url.split("#")[0]
url = url.split("?")[0]
return os.path.basename(url)
Examples:
print(get_url_file_name("example.com/myfile.tar.gz")) # 'myfile.tar.gz'
print(get_url_file_name("example.com/")) # ''
print(get_url_file_name("https://example.com/")) # ''
print(get_url_file_name("https://example.com/hello.zip")) # 'hello.zip'
print(get_url_file_name("https://example.com/args.tar.gz?c=d#e")) # 'args.tar.gz'
Sometimes the link you have can have redirects (that was the case for me). In that case you have to solve the redirects
import requests
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
response = requests.head(url)
url = response.url
then you can continue with the best answer at the moment (Ofir’s)
import os
from urllib.parse import urlparse
a = urlparse(url)
print(a.path) # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path)) # Output: 09-09-201315-47-571378756077.jpg
it doesn’t work with this page however, as the page isn’t available anymore
I see people using the Pathlib
library to parse URLs. This is not a good idea! Pathlib is not designed for it, use special libraries like urllib
or similar instead.
This is the most stable version I could come up with. It handles params as well as fragments:
from urllib.parse import urlparse, ParseResult
def update_filename(url):
parsed_url = urlparse(url)
path = parsed_url.path
filename = path[path.rfind('/') + 1:]
if not filename:
return
file, extension = filename.rsplit('.', 1)
new_path = parsed_url.path.replace(filename, f"{file}_small.{extension}")
parsed_url = ParseResult(**{**parsed_url._asdict(), 'path': new_path})
return parsed_url.geturl()
Example:
assert update_filename('https://example.com/') is None
assert update_filename('https://example.com/path/to/') is None
assert update_filename('https://example.com/path/to/report.pdf') == 'https://example.com/path/to/report_small.pdf'
assert update_filename('https://example.com/path/to/filename with spaces.pdf') == 'https://example.com/path/to/filename with spaces_small.pdf'
assert update_filename('https://example.com/path/to/report_01.01.2022.pdf') == 'https://example.com/path/to/report_01.01.2022_small.pdf'
assert update_filename('https://example.com/path/to/report.pdf?param=1¶m2=2') == 'https://example.com/path/to/report_small.pdf?param=1¶m2=2'
assert update_filename('https://example.com/path/to/report.pdf?param=1¶m2=2#test') == 'https://example.com/path/to/report_small.pdf?param=1¶m2=2#test'
I have the following URL:
url = http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg
I would like to extract the file name in this URL: 09-09-201315-47-571378756077.jpg
Once I get this file name, I’m going to save it with this name to the Desktop.
filename = **extracted file name from the url**
download_photo = urllib.urlretrieve(url, "/home/ubuntu/Desktop/%s.jpg" % (filename))
After this, I’m going to resize the photo, once that is done, I’ve going to save the resized version and append the word "_small" to the end of the filename.
downloadedphoto = Image.open("/home/ubuntu/Desktop/%s.jpg" % (filename))
resize_downloadedphoto = downloadedphoto.resize.((300, 300), Image.ANTIALIAS)
resize_downloadedphoto.save("/home/ubuntu/Desktop/%s.jpg" % (filename + _small))
From this, what I am trying to achieve is to get two files, the original photo with the original name, then the resized photo with the modified name. Like so:
09-09-201315-47-571378756077.jpg
rename to:
09-09-201315-47-571378756077_small.jpg
How can I go about doing this?
filename = url[url.rfind("/")+1:]
filename_small = filename.replace(".", "_small.")
maybe use “.jpg” in the last case since a . can also be in the filename.
You can use urllib.parse.urlparse
with os.path.basename
:
import os
from urllib.parse import urlparse
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
a = urlparse(url)
print(a.path) # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path)) # Output: 09-09-201315-47-571378756077.jpg
Your URL might contain percent-encoded characters like %20
for space or %E7%89%B9%E8%89%B2
for "特色". If that’s the case, you’ll need to unquote
(or unquote_plus
) them. You can also use pathlib.Path().name
instead of os.path.basename
, which could help to add a suffix in the name (like asked in the original question):
from pathlib import Path
from urllib.parse import urlparse, unquote
url = "http://photographs.500px.com/kyle/09-09-2013%20-%2015-47-571378756077.jpg"
urlparse(url).path
url_parsed = urlparse(url)
print(unquote(url_parsed.path)) # Output: /kyle/09-09-2013 - 15-47-571378756077.jpg
file_path = Path("/home/ubuntu/Desktop/") / unquote(Path(url_parsed.path).name)
print(file_path) # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077.jpg
new_file = file_path.with_stem(file_path.stem + "_small")
print(new_file) # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077_small.jpg
Also, an alternative is to use unquote(urlparse(url).path.split("/")[-1])
.
Python split url to find image name and extension
helps you to extract the image name. to append name :
imageName = '09-09-201315-47-571378756077'
new_name = '{0}_small.jpg'.format(imageName)
You could just split the url by "/" and retrieve the last member of the list:
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
filename = url.split("/")[-1]
#09-09-201315-47-571378756077.jpg
Then use replace
to change the ending:
small_jpg = filename.replace(".jpg", "_small.jpg")
#09-09-201315-47-571378756077_small.jpg
os.path.basename(url)
Why try harder?
In [1]: os.path.basename("https://example.com/file.html")
Out[1]: 'file.html'
In [2]: os.path.basename("https://example.com/file")
Out[2]: 'file'
In [3]: os.path.basename("https://example.com/")
Out[3]: ''
In [4]: os.path.basename("https://example.com")
Out[4]: 'example.com'
Note 2020-12-20
Nobody has thus far provided a complete solution.
A URL can contain a ?[query-string]
and/or a #[fragment Identifier]
(but only in that order: ref)
In [1]: from os import path
In [2]: def get_filename(url):
...: fragment_removed = url.split("#")[0] # keep to left of first #
...: query_string_removed = fragment_removed.split("?")[0]
...: scheme_removed = query_string_removed.split("://")[-1].split(":")[-1]
...: if scheme_removed.find("/") == -1:
...: return ""
...: return path.basename(scheme_removed)
...:
In [3]: get_filename("a.com/b")
Out[3]: 'b'
In [4]: get_filename("a.com/")
Out[4]: ''
In [5]: get_filename("https://a.com/")
Out[5]: ''
In [6]: get_filename("https://a.com/b")
Out[6]: 'b'
In [7]: get_filename("https://a.com/b?c=d#e")
Out[7]: 'b'
Sometimes there is a query string:
filename = url.split("/")[-1].split("?")[0]
new_filename = filename.replace(".jpg", "_small.jpg")
Use urllib.parse.urlparse
to get just the path part of the URL, and then use pathlib.Path
on that path to get the filename:
from urllib.parse import urlparse
from pathlib import Path
url = "http://example.com/some/long/path/a_filename.jpg?some_query_params=true&some_more=true#and-an-anchor"
a = urlparse(url)
a.path # '/some/long/path/a_filename.jpg'
Path(a.path).name # 'a_filename.jpg'
We can extract filename from a url by using ntpath module.
import ntpath
url = 'http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg'
name, ext = ntpath.splitext(ntpath.basename(url))
# 09-09-201315-47-571378756077 .jpg
print(name + '_small' + ext)
09-09-201315-47-571378756077_small.jpg
With python3 (from 3.4 upwards) you can abuse the pathlib
library in the following way:
from pathlib import Path
p = Path('http://example.com/somefile.html')
print(p.name)
# >>> 'somefile.html'
print(p.stem)
# >>> 'somefile'
print(p.suffix)
# >>> '.html'
print(f'{p.stem}-spamspam{p.suffix}')
# >>> 'somefile-spamspam.html'
❗️ WARNING
The pathlib
module is NOT meant for parsing URLs — it is designed to work with POSIX paths and not with URLs. Don’t use it in production code! It’s a dirty quick hack for non-critical code. The code is only provided as an example of what you can do but probably should not do. If you need to parse URLs then go with urllib.parse or alternatives.
A simple version using the os
package:
import os
def get_url_file_name(url):
url = url.split("#")[0]
url = url.split("?")[0]
return os.path.basename(url)
Examples:
print(get_url_file_name("example.com/myfile.tar.gz")) # 'myfile.tar.gz'
print(get_url_file_name("example.com/")) # ''
print(get_url_file_name("https://example.com/")) # ''
print(get_url_file_name("https://example.com/hello.zip")) # 'hello.zip'
print(get_url_file_name("https://example.com/args.tar.gz?c=d#e")) # 'args.tar.gz'
Sometimes the link you have can have redirects (that was the case for me). In that case you have to solve the redirects
import requests
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
response = requests.head(url)
url = response.url
then you can continue with the best answer at the moment (Ofir’s)
import os
from urllib.parse import urlparse
a = urlparse(url)
print(a.path) # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path)) # Output: 09-09-201315-47-571378756077.jpg
it doesn’t work with this page however, as the page isn’t available anymore
I see people using the Pathlib
library to parse URLs. This is not a good idea! Pathlib is not designed for it, use special libraries like urllib
or similar instead.
This is the most stable version I could come up with. It handles params as well as fragments:
from urllib.parse import urlparse, ParseResult
def update_filename(url):
parsed_url = urlparse(url)
path = parsed_url.path
filename = path[path.rfind('/') + 1:]
if not filename:
return
file, extension = filename.rsplit('.', 1)
new_path = parsed_url.path.replace(filename, f"{file}_small.{extension}")
parsed_url = ParseResult(**{**parsed_url._asdict(), 'path': new_path})
return parsed_url.geturl()
Example:
assert update_filename('https://example.com/') is None
assert update_filename('https://example.com/path/to/') is None
assert update_filename('https://example.com/path/to/report.pdf') == 'https://example.com/path/to/report_small.pdf'
assert update_filename('https://example.com/path/to/filename with spaces.pdf') == 'https://example.com/path/to/filename with spaces_small.pdf'
assert update_filename('https://example.com/path/to/report_01.01.2022.pdf') == 'https://example.com/path/to/report_01.01.2022_small.pdf'
assert update_filename('https://example.com/path/to/report.pdf?param=1¶m2=2') == 'https://example.com/path/to/report_small.pdf?param=1¶m2=2'
assert update_filename('https://example.com/path/to/report.pdf?param=1¶m2=2#test') == 'https://example.com/path/to/report_small.pdf?param=1¶m2=2#test'