Building a custom canonical url in python

Question:

I want to build a canonical url for my website: my.com

here are the requirements:

  1. always include www subdomain
  2. always use https protocol
  3. remove default 80 and 443 ports
  4. remove trailing slash

Example:

http://my.com => https://www.my.com
http://my.com/ => https://www.my.com
https://my.com:80/ => https://www.my.com
https://sub.my.com/ => https://sub.my.com
https://sub.my.com?term=t1 => https://sub.my.com?term=t1

This is what I have tried:

from urllib.parse import urlparse, urljoin

def build_canonical_url(request):
    absolute = request.build_absolute_uri(request.path)
    parsed = urlparse(absolute)

    parsed.scheme == 'https'
    if parsed.hostname.startswith('my.com'):
        parsed.hostname == 'www.my.com'
    if parsed.port == 80 or parsed.port == 443:
        parsed.port == None

    # how to join this url components?
    # canonical = join parsed.scheme, parsed.hostname, parsed.port and parsed.query

But I don’t know how to join these url components?

Asked By: Hooman Bahreini

||

Answers:

So, I have always used the urllib for these applications but never had to format this, as you are asking.

The way that I see this is the following:

1 – Parse the URL, using urllib.parse

2 – Decompose the URL in its bases

3 – Reassembly the URL, adding the desired formatting.

Code example:

from urllib.parse import urlparse
urlparse("scheme://netloc/path;parameters?query#fragment")

o = urlparse("https://my.com:80/mypath/lalala")

print(o)

ParseResult(scheme='https', netloc='docs.python.org:80',
            path='/3/library/urllib.parse.html', params='',
            query='highlight=params', fragment='url-parsing')

scheme = o.scheme # 'https'
netlock = o.netloc # 'docs.python.org:80'
host = o.hostname # 'docs.python.org'
path = o.path # '/mypath/lalala'

formated_url = scheme + '://www.' host + path 

For more detailed information, refer to urllib docs.

Answered By: morallito

You just need to write a simple function,

In [1]: def build_canonical_url(url):
    ...:     parsed = urlparse(url)
    ...:     port = ''
    ...:     if parsed.hostname.startswith('my.com') or parsed.hostname.startswith('www.my.com'):
    ...:         hostname = 'www.my.com'
    ...:     else:
    ...:         hostname = parsed.hostname
    ...:     if parsed.port == 80 or parsed.port == 443:
    ...:         port = ''
    ...:     scheme = 'https'
    ...:     parsed_url = f'{scheme}://{hostname}'
    ...:     if port:
    ...:         parsed_url = f'{parsed_ur}:{port}/'
    ...:     if parsed.query:
    ...:         parsed_url = f'{parsed_url}?{parsed.query}'
    ...:     return parsed_url
    ...: 

Execution,

In [2]: urls = ["http://my.com", "http://my.com/", "https://my.com:80/", "https://sub.my.com/", "https://sub.my.com?term=t1"]
In [3]: for url in urls:
    ...:     print(f'{url} >> {build_canonical_url(url)}')
    ...: 
http://my.com >> https://www.my.com
http://my.com/ >> https://www.my.com
https://my.com:80/ >> https://www.my.com
https://sub.my.com/ >> https://sub.my.com
https://sub.my.com?term=t1 >> https://sub.my.com?term=t1

Few issues of your code,
parsed.scheme == ‘https’ -> It’s not the right way to assign a value, It’s a statement gives True or False And parsed.scheme doesn’t allow to setttr.

Answered By: Rahul K P
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.