How to join components of a path when you are constructing a URL in Python

Question:

For example, I want to join a prefix path to resource paths like /js/foo.js.

I want the resulting path to be relative to the root of the server. In the above example if the prefix was “media” I would want the result to be /media/js/foo.js.

os.path.join does this really well, but how it joins paths is OS dependent. In this case I know I am targeting the web, not the local file system.

Is there a best alternative when you are working with paths you know will be used in URLs? Will os.path.join work well enough? Should I just roll my own?

Asked By: amjoconn

||

Answers:

You can use urllib.parse.urljoin:

>>> from urllib.parse import urljoin
>>> urljoin('/media/path/', 'js/foo.js')
'/media/path/js/foo.js'

But beware:

>>> urljoin('/media/path', 'js/foo.js')
'/media/js/foo.js'
>>> urljoin('/media/path', '/js/foo.js')
'/js/foo.js'

The reason you get different results from /js/foo.js and js/foo.js is because the former begins with a slash which signifies that it already begins at the website root.

On Python 2, you have to do

from urlparse import urljoin
Answered By: Ben James

The basejoin function in the urllib package might be what you’re looking for.

basejoin = urljoin(base, url, allow_fragments=True)
    Join a base URL and a possibly relative URL to form an absolute
    interpretation of the latter.

Edit: I didn’t notice before, but urllib.basejoin seems to map directly to urlparse.urljoin, making the latter preferred.

Answered By: mwcz

Since, from the comments the OP posted, it seems he doesn’t want to preserve “absolute URLs” in the join (which is one of the key jobs of urlparse.urljoin;-), I’d recommend avoiding that. os.path.join would also be bad, for exactly the same reason.

So, I’d use something like '/'.join(s.strip('/') for s in pieces) (if the leading / must also be ignored — if the leading piece must be special-cased, that’s also feasible of course;-).

Answered By: Alex Martelli

This does the job nicely:

def urljoin(*args):
    """
    Joins given arguments into an url. Trailing but not leading slashes are
    stripped for each argument.
    """

    return "/".join(map(lambda x: str(x).rstrip('/'), args))
Answered By: Rune Kaagaard

Like you say, os.path.join joins paths based on the current os. posixpath is the underlying module that is used on posix systems under the namespace os.path:

>>> os.path.join is posixpath.join
True
>>> posixpath.join('/media/', 'js/foo.js')
'/media/js/foo.js'

So you can just import and use posixpath.join instead for urls, which is available and will work on any platform.

Edit: @Pete’s suggestion is a good one, you can alias the import for increased readability

from posixpath import join as urljoin

Edit: I think this is made clearer, or at least helped me understand, if you look into the source of os.py (the code here is from Python 2.7.11, plus I’ve trimmed some bits). There’s conditional imports in os.py that picks which path module to use in the namespace os.path. All the underlying modules (posixpath, ntpath, os2emxpath, riscospath) that may be imported in os.py, aliased as path, are there and exist to be used on all systems. os.py is just picking one of the modules to use in the namespace os.path at run time based on the current OS.

# os.py
import sys, errno

_names = sys.builtin_module_names

if 'posix' in _names:
    # ...
    from posix import *
    # ...
    import posixpath as path
    # ...

elif 'nt' in _names:
    # ...
    from nt import *
    # ...
    import ntpath as path
    # ...

elif 'os2' in _names:
    # ...
    from os2 import *
    # ...
    if sys.version.find('EMX GCC') == -1:
        import ntpath as path
    else:
        import os2emxpath as path
        from _emx_link import link
    # ...

elif 'ce' in _names:
    # ...
    from ce import *
    # ...
    # We can use the standard Windows path.
    import ntpath as path

elif 'riscos' in _names:
    # ...
    from riscos import *
    # ...
    import riscospath as path
    # ...

else:
    raise ImportError, 'no os specific module found'
Answered By: GP89

I know this is a bit more than the OP asked for, However I had the pieces to the following url, and was looking for a simple way to join them:

>>> url = 'https://api.foo.com/orders/bartag?spamStatus=awaiting_spam&page=1&pageSize=250'

Doing some looking around:

>>> split = urlparse.urlsplit(url)
>>> split
SplitResult(scheme='https', netloc='api.foo.com', path='/orders/bartag', query='spamStatus=awaiting_spam&page=1&pageSize=250', fragment='')
>>> type(split)
<class 'urlparse.SplitResult'>
>>> dir(split)
['__add__', '__class__', '__contains__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_asdict', '_fields', '_make', '_replace', 'count', 'fragment', 'geturl', 'hostname', 'index', 'netloc', 'password', 'path', 'port', 'query', 'scheme', 'username']
>>> split[0]
'https'
>>> split = (split[:])
>>> type(split)
<type 'tuple'>

So in addition to the path joining which has already been answered in the other answers, To get what I was looking for I did the following:

>>> split
('https', 'api.foo.com', '/orders/bartag', 'spamStatus=awaiting_spam&page=1&pageSize=250', '')
>>> unsplit = urlparse.urlunsplit(split)
>>> unsplit
'https://api.foo.com/orders/bartag?spamStatus=awaiting_spam&page=1&pageSize=250'

According to the documentation it takes EXACTLY a 5 part tuple.

With the following tuple format:

scheme 0 URL scheme specifier empty string

netloc 1 Network location part empty string

path 2 Hierarchical path empty string

query 3 Query component empty string

fragment 4 Fragment identifier empty string

Answered By: jmunsch

To improve slightly over Alex Martelli’s response, the following will not only cleanup extra slashes but also preserve trailing (ending) slashes, which can sometimes be useful :

>>> items = ["http://www.website.com", "/api", "v2/"]
>>> url = "/".join([(u.strip("/") if index + 1 < len(items) else u.lstrip("/")) for index, u in enumerate(items)])
>>> print(url)
http://www.website.com/api/v2/

It’s not as easy to read though, and won’t cleanup multiple extra trailing slashes.

Answered By: Florent Thiery

Using furl, pip install furl it will be:

 furl.furl('/media/path/').add(path='js/foo.js')
Answered By: Vasili Pascal

Rune Kaagaard provided a great and compact solution that worked for me, I expanded on it a little:

def urljoin(*args):
    trailing_slash = '/' if args[-1].endswith('/') else ''
    return "/".join(map(lambda x: str(x).strip('/'), args)) + trailing_slash

This allows all arguments to be joined regardless of trailing and ending slashes while preserving the last slash if present.

Answered By: futuere

Using furl and regex (python 3)

>>> import re
>>> import furl
>>> p = re.compile(r'(/)+')
>>> url = furl.furl('/media/path').add(path='/js/foo.js').url
>>> url
'/media/path/js/foo.js'
>>> p.sub(r"1", url)
'/media/path/js/foo.js'
>>> url = furl.furl('/media/path').add(path='js/foo.js').url
>>> url
'/media/path/js/foo.js'
>>> p.sub(r"1", url)
'/media/path/js/foo.js'
>>> url = furl.furl('/media/path/').add(path='js/foo.js').url
>>> url
'/media/path/js/foo.js'
>>> p.sub(r"1", url)
'/media/path/js/foo.js'
>>> url = furl.furl('/media///path///').add(path='//js///foo.js').url
>>> url
'/media///path/////js///foo.js'
>>> p.sub(r"1", url)
'/media/path/js/foo.js'
Answered By: Guillaume Cisco

I found things not to like about all the above solutions, so I came up with my own. This version makes sure parts are joined with a single slash and leaves leading and trailing slashes alone. No pip install, no urllib.parse.urljoin weirdness.

In [1]: from functools import reduce

In [2]: def join_slash(a, b):
   ...:     return a.rstrip('/') + '/' + b.lstrip('/')
   ...:

In [3]: def urljoin(*args):
   ...:     return reduce(join_slash, args) if args else ''
   ...:

In [4]: parts = ['https://foo-bar.quux.net', '/foo', 'bar', '/bat/', '/quux/']

In [5]: urljoin(*parts)
Out[5]: 'https://foo-bar.quux.net/foo/bar/bat/quux/'

In [6]: urljoin('https://quux.com/', '/path', 'to/file///', '//here/')
Out[6]: 'https://quux.com/path/to/file/here/'

In [7]: urljoin()
Out[7]: ''

In [8]: urljoin('//','beware', 'of/this///')
Out[8]: '/beware/of/this///'

In [9]: urljoin('/leading', 'and/', '/trailing/', 'slash/')
Out[9]: '/leading/and/trailing/slash/'
Answered By: cbare

How about this:
It is Somewhat Efficient & Somewhat Simple.
Only need to join ‘2’ parts of url path:

def UrlJoin(a , b):
    a, b = a.strip(), b.strip()
    a = a if a.endswith('/') else a + '/'
    b = b if not b.startswith('/') else b[1:]
    return a + b

OR: More Conventional, but Not as efficient if joining only 2 url parts of a path.

def UrlJoin(*parts):
    return '/'.join([p.strip().strip('/') for p in parts])

Test Cases:

>>> UrlJoin('https://example.com/', '/TestURL_1')
'https://example.com/TestURL_1'

>>> UrlJoin('https://example.com', 'TestURL_2')
'https://example.com/TestURL_2'

Note: I may be splitting hairs here, but it is at least good practice and potentially more readable.

Answered By: Andrew

One liner:

from functools import reduce
reduce(lambda x,y: '{}/{}'.format(x,y), parts) 

where parts is e.g [‘https://api.somecompany.com/v1’, ‘weather’, ‘rain’]

Yet another variation with unique features:

def urljoin(base:str, *parts:str) -> str:
    for part in filter(None, parts):
        base = '{}/{}'.format(base.rstrip('/'), part.lstrip('/'))
    return base
  • Preserve trailing slash in base or last part
  • Empty parts are ignored
  • For each non-empty part, remove trailing from base and leading from part and join with a single /
urljoin('http://a.com/api',  '')  -> 'http://a.com/api'
urljoin('http://a.com/api',  '/') -> 'http://a.com/api/'
urljoin('http://a.com/api/', '')  -> 'http://a.com/api/'
urljoin('http://a.com/api/', '/') -> 'http://a.com/api/'
urljoin('http://a.com/api/', '/a/', '/b', 'c', 'd/') -> 'http://a.com/api/a/b/c/d/'
Answered By: MestreLion

Ok, that’s what I did, because I needed complete independence from predefined roots:

def url_join(base: str, *components: str, slash_left=True, slash_right=True) -> str:
    """Join two or more url components, inserting '/' as needed.
    Optionally, a slash can be added to the left or right side of the URL.
    """
    base = base.lstrip('/').rstrip('/')
    components = [component.lstrip('/').rstrip('/') for component in components]
    url = f"/{base}" if slash_left else base
    for component in components:
        url = f"{url}/{component}" 
    return f"{url}/" if slash_right else url

url_join("http://whoops.io", "foo/", "/bar", "foo", slash_left=False)
# "http://whoops.io/foo/bar/foo/"
url_join("foo", "bar")
# "/foo/bar/""
Answered By: Zio

Here’s a safe version, I’m using. It takes care of prefixes and trailing slashes. The trailing slash for the end URI is handled separately

def safe_urljoin(*uris) -> str:
    """
    Joins the URIs carefully considering the prefixes and trailing slashes.
    The trailing slash for the end URI is handled separately.
    """
    if len(uris) == 1:
        return uris[0]

    safe_urls = [
        f"{url.lstrip('/')}/" if not url.endswith("/") else url.lstrip("/")
        for url in uris[:-1]
    ]
    safe_urls.append(uris[-1].lstrip("/"))
    return "".join(safe_urls)

The output

>>> safe_urljoin("https://a.com/", "adunits/", "/both/", "/left")
>>> 'https://a.com/adunits/both/left'

>>> safe_urljoin("https://a.com/", "adunits/", "/both/", "right/")
>>> 'https://a.com/adunits/both/right/'

>>> safe_urljoin("https://a.com/", "adunits/", "/both/", "right/", "none")
>>> 'https://a.com/adunits/both/right/none'

>>> safe_urljoin("https://a.com/", "adunits/", "/both/", "right/", "none/")
>>> 'https://a.com/adunits/both/right/none/'
Answered By: Premkumar chalmeti
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.