Which should I be using: urlparse or urlsplit?

Question:

Which URL parsing function pair should I be using and why?

Asked By: Matt Joiner

||

Answers:

Directly from the docs you linked yourself:

urllib.parse.urlsplit(urlstring, scheme='', allow_fragments=True)
This is similar to urlparse(), but does not split the params from the URL. This should generally be used instead of urlparse() if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL (see RFC 2396) is wanted.

Answered By: Sven Marnach

As the document says
urlparse.urlparse returns 6-tuple(with additional parameter tuple)
urlparse.urlsplit returns 5-tuple

Attribute   |Index | Value                                             | Value if not present
params    |     3   | Parameters for last path element | empty string

FYI: According to [RFC2396](https://www.rfc-editor.org/rfc/rfc2396.html#appendix-C), _parameter_ in URL specification
> Extensive testing of current client applications demonstrated that
the majority of deployed systems do not use the “;” character to
indicate trailing parameter information, and that the presence of a
semicolon in a path segment does not affect the relative parsing of
that segment. Therefore, parameters have been removed as a separate
component and may now appear in any path segment. Their influence
has been removed from the algorithm for resolving a relative URI
reference.

Answered By: Jiahao D.

Given the documentation you linked didn’t include an example with an nonempty params I was also confused until I found this.

>>> urllib.parse.urlparse("http://example.com/pa/th;param1=foo;param2=bar?name=val#frag")
ParseResult(scheme='http', netloc='example.com', path='/pa/th', params='param1=foo;param2=bar', query='name=val', fragment='frag')

(Some history because I got nerd-sniped.)

I’d never heard of the URL "parameters" other than url component params i.e. /user/213/settings or query params /user?id=213 and I think it’s essentially obsolete.

In the beginning, RFC 1738 defined the HTTP URL to never allow ; in the path:

http://<host>:<port>/<path>?<searchpart>

Within the <path> and <searchpart> components, "/", ";", "?" are reserved.

; was reserved with special meaning in other schemes, like the ftp:// url-path:

<cwd1>/<cwd2>/.../<cwdN>/<name>;type=<typecode>

Apparently in 1995, RFC 1808 defined URL params as a top-level component between path and query:

<scheme>://<net_loc>/<path>;<params>?<query>#<fragment>

Then in 1998, RFC 2396 defined URIs as having adjacent top-level components path and query:

<scheme>://<authority><path>?<query>

where the path is defined as multiple path_segments that each could include param:

path          = [ abs_path | opaque_part ]
abs_path      = "/"  path_segments
path_segments = segment *( "/" segment )
segment       = *pchar *( ";" param )

Finally in 2005, RFC 3986 obsoleted RFC 1808 and 2396, defining URI similarly to RFC 2396:

URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ] 

hier-part   = "//" authority path-abempty
            / path-absolute
            / path-rootless
            / path-empty

And the special syntax of ;params is considered an opaque part of the URI syntax that may be specific to the HTTP(S) scheme or just some specific implementation:

Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference-handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same. Parameter types may be defined by scheme-specific semantics, but in most cases the syntax of a parameter is specific to the implementation of the URI’s dereferencing algorithm.

Answered By: Carl Walsh
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.