How to get specific part of any url using urlparse()?

Question:

I have an url like this

url = 'https://grabagun.com/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'

When I use urlparse() function, I am getting result like this:

>>> url = urlparse(url) 
>>> url.path
'/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'

Is it possible to get something like this:

path1 = "firearms"
path2 = "handguns"
path3 = "semi-automatic-handguns"

and I don’t want to get any text which have ".html" at the end.

Asked By: boyenec

||

Answers:

path_list = url.path.split('/')

if ".html" in path_list[-1]:
    path_list = path_list[:-1]

will give you a list with each part as an entry and exclude the last one if it contains ".html" in it.

Depending on exactly what you want or how specific/general your use case is you can edit this.

Answered By: arielkaluzhny

You can put it all in a array separating them by the /

url.path.split('/')

and if you want to put the them in path1, path2 and so on you can assign the values in the list to variables.

path1, path2, path3 = url.path.split('/')[:3]

I put it only to get the first 3 values of the list.
If you don’t want the text with .html you can always get the index of the last value and use it in the list slicing like this.

paths = url.path.split('/')
if '.html' in paths[-1]:
    html_text_index = paths.index(paths[-1])
no_html_paths = paths[:html_text_index]
Answered By: bener07

You have some single / and some path have //…first replace all with same if you want apply directly on URL. For url.path you can do it directly

url = '/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'

url = url.split('/')
url = list(filter(None, url))#remove empty elemnt
url.pop()
print(url)

output list #

['firearms', 'handguns', 'semi-automatic-handguns']

Part 2

If you want to make them varaibles then simply itterate over them and create variables

for n, val in enumerate(url):
    globals()["path%d"%n] = val

print(path1)

output #

handguns
Answered By: Bhargav

One liner solution to your problem could be:

path=urlparse(url).path[1:]

splittedpath=[sp for sp in path.split("/") if not sp.endswith(".html")]
"""
['firearms', 'handguns', 'semi-automatic-handguns']
"""

You can access these by:

print(splittedpath[0]) # 0,1,2... 
# firearms

What we are doing here is, first string of path is removed which is "/" by doing path.path[1:], splitting string path from each occurance of "/" using .split("/") and checking if that splitted string ends with ".html" or not,if not save it.

Answered By: imxitiz

Yes, it is possible to extract the individual path components of a URL like this using Python’s urlparse module.

Here’s one way you can do it:

from urllib.parse import urlparse

url = 'https://grabagun.com/firearms/handguns/semi-automatic-handguns/glock-19-gen-5-polished-nickel-9mm-4-02-inch-barrel-15-rounds-exclusive.html'

parsed_url = urlparse(url)

path = parsed_url.path

path_components = path.split('/')

# remove the empty string at the beginning of the list
path_components = path_components[1:]

# remove the last element if it ends with '.html'
if path_components[-1].endswith('.html'):
  path_components = path_components[:-1]

print(path_components)
# Output: ['firearms', 'handguns', 'semi-automatic-handguns']

This code first uses urlparse to parse the URL, and then splits the path component of the URL using the split method. It removes the empty string at the beginning of the list, and then removes the last element if it ends with ‘.html’. The resulting list will contain the individual path components of the URL.

Answered By: dsds