How to get specific part of any url using urlparse()?
Question:
I have an url like this
url = 'https://grabagun.com/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'
When I use urlparse()
function, I am getting result like this:
>>> url = urlparse(url)
>>> url.path
'/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'
Is it possible to get something like this:
path1 = "firearms"
path2 = "handguns"
path3 = "semi-automatic-handguns"
and I don’t want to get any text which have ".html" at the end.
Answers:
path_list = url.path.split('/')
if ".html" in path_list[-1]:
path_list = path_list[:-1]
will give you a list with each part as an entry and exclude the last one if it contains ".html" in it.
Depending on exactly what you want or how specific/general your use case is you can edit this.
You can put it all in a array separating them by the /
url.path.split('/')
and if you want to put the them in path1, path2 and so on you can assign the values in the list to variables.
path1, path2, path3 = url.path.split('/')[:3]
I put it only to get the first 3 values of the list.
If you don’t want the text with .html you can always get the index of the last value and use it in the list slicing like this.
paths = url.path.split('/')
if '.html' in paths[-1]:
html_text_index = paths.index(paths[-1])
no_html_paths = paths[:html_text_index]
You have some single /
and some path have //
…first replace all with same if you want apply directly on URL. For url.path
you can do it directly
url = '/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'
url = url.split('/')
url = list(filter(None, url))#remove empty elemnt
url.pop()
print(url)
output list #
['firearms', 'handguns', 'semi-automatic-handguns']
Part 2
If you want to make them varaibles then simply itterate over them and create variables
for n, val in enumerate(url):
globals()["path%d"%n] = val
print(path1)
output #
handguns
One liner solution to your problem could be:
path=urlparse(url).path[1:]
splittedpath=[sp for sp in path.split("/") if not sp.endswith(".html")]
"""
['firearms', 'handguns', 'semi-automatic-handguns']
"""
You can access these by:
print(splittedpath[0]) # 0,1,2...
# firearms
What we are doing here is, first string of path is removed which is "/" by doing path.path[1:]
, splitting string path from each occurance of "/" using .split("/")
and checking if that splitted string ends with ".html" or not,if not save it.
Yes, it is possible to extract the individual path components of a URL like this using Python’s urlparse module.
Here’s one way you can do it:
from urllib.parse import urlparse
url = 'https://grabagun.com/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'
parsed_url = urlparse(url)
path = parsed_url.path
path_components = path.split('/')
# remove the empty string at the beginning of the list
path_components = path_components[1:]
# remove the last element if it ends with '.html'
if path_components[-1].endswith('.html'):
path_components = path_components[:-1]
print(path_components)
# Output: ['firearms', 'handguns', 'semi-automatic-handguns']
This code first uses urlparse to parse the URL, and then splits the path component of the URL using the split method. It removes the empty string at the beginning of the list, and then removes the last element if it ends with ‘.html’. The resulting list will contain the individual path components of the URL.
I have an url like this
url = 'https://grabagun.com/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'
When I use urlparse()
function, I am getting result like this:
>>> url = urlparse(url)
>>> url.path
'/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'
Is it possible to get something like this:
path1 = "firearms"
path2 = "handguns"
path3 = "semi-automatic-handguns"
and I don’t want to get any text which have ".html" at the end.
path_list = url.path.split('/')
if ".html" in path_list[-1]:
path_list = path_list[:-1]
will give you a list with each part as an entry and exclude the last one if it contains ".html" in it.
Depending on exactly what you want or how specific/general your use case is you can edit this.
You can put it all in a array separating them by the /
url.path.split('/')
and if you want to put the them in path1, path2 and so on you can assign the values in the list to variables.
path1, path2, path3 = url.path.split('/')[:3]
I put it only to get the first 3 values of the list.
If you don’t want the text with .html you can always get the index of the last value and use it in the list slicing like this.
paths = url.path.split('/')
if '.html' in paths[-1]:
html_text_index = paths.index(paths[-1])
no_html_paths = paths[:html_text_index]
You have some single /
and some path have //
…first replace all with same if you want apply directly on URL. For url.path
you can do it directly
url = '/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'
url = url.split('/')
url = list(filter(None, url))#remove empty elemnt
url.pop()
print(url)
output list #
['firearms', 'handguns', 'semi-automatic-handguns']
Part 2
If you want to make them varaibles then simply itterate over them and create variables
for n, val in enumerate(url):
globals()["path%d"%n] = val
print(path1)
output #
handguns
One liner solution to your problem could be:
path=urlparse(url).path[1:]
splittedpath=[sp for sp in path.split("/") if not sp.endswith(".html")]
"""
['firearms', 'handguns', 'semi-automatic-handguns']
"""
You can access these by:
print(splittedpath[0]) # 0,1,2...
# firearms
What we are doing here is, first string of path is removed which is "/" by doing path.path[1:]
, splitting string path from each occurance of "/" using .split("/")
and checking if that splitted string ends with ".html" or not,if not save it.
Yes, it is possible to extract the individual path components of a URL like this using Python’s urlparse module.
Here’s one way you can do it:
from urllib.parse import urlparse
url = 'https://grabagun.com/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'
parsed_url = urlparse(url)
path = parsed_url.path
path_components = path.split('/')
# remove the empty string at the beginning of the list
path_components = path_components[1:]
# remove the last element if it ends with '.html'
if path_components[-1].endswith('.html'):
path_components = path_components[:-1]
print(path_components)
# Output: ['firearms', 'handguns', 'semi-automatic-handguns']
This code first uses urlparse to parse the URL, and then splits the path component of the URL using the split method. It removes the empty string at the beginning of the list, and then removes the last element if it ends with ‘.html’. The resulting list will contain the individual path components of the URL.