Url decode UTF-8 in Python

Question:

In Python 2.7, given a URL like example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0, how can I decode it to the expected result, example.com?title==правовая+защита?

I tried url=urllib.unquote(url.encode("utf8")), but it seems to give a wrong result.

Asked By: swordholder

||

Answers:

The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:

from urllib.parse import unquote

url = unquote(url)

Demo:

>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'

The Python 2 equivalent is urllib.unquote(), but this returns a bytestring, so you’d have to decode manually:

from urllib import unquote

url = unquote(url).decode('utf8')
Answered By: Martijn Pieters

If you are using Python 3, you can use urllib.parse.unquote:

url = """example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0"""

import urllib.parse
urllib.parse.unquote(url)

gives:

'example.com?title=правовая+защита'
Answered By: pavan

You can achieve an expected result with requests library as well:

import requests

url = "http://www.mywebsite.org/Data%20Set.zip"

print(f"Before: {url}")
print(f"After:  {requests.utils.unquote(url)}")

Output:

$ python3 test_url_unquote.py

Before: http://www.mywebsite.org/Data%20Set.zip
After:  http://www.mywebsite.org/Data Set.zip

Might be handy if you are already using requests, without using another library for this job.

Answered By: ivanleoncz

In HTML the URLs can contain html entities.
This replaces them, too.

#from urllib import unquote #earlier python version
from urllib.request import unquote
from html import unescape
unescape(unquote('https://v.w.xy/p1/p22?userId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&confirmationToken=7uAf%2fxJoxRTFAZdxslCn2uwVR9vV7cYrlHs%2fl9sU%2frix9f9CnVx8uUT%2bu8y1%2fWCs99INKDnfA2ayhGP1ZD0z%2bodXjK9xL5I4gjKR2xp7p8Sckvb04mddf%2fiG75QYiRevgqdMnvd9N5VZp2ksBc83lDg7%2fgxqIwktteSI9RA3Ux9VIiNxx%2fZLe9dZSHxRq9AA'))
Answered By: Roland Puntaier

I know this is an old question, but I stumbled upon this via Google search and found that no one has proposed a solution with only built-in features.

So I quickly wrote my own.

Basically a url string can only contain these characters: A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], @, !, $, &, ‘, (, ), *, +, ,, ;, %, and =, everything else are url encoded.

URL encoding is pretty straight forward, just a percent sign followed by the hexadecimal digits of the byte values corresponding to the codepoints of illegal characters.

So basically using a simple while loop to iterate the characters, add any character’s byte as is if it is not a percent sign, increment index by one, else add the byte following the percent sign and increment index by three, accumulate the bytes and decoding them should work perfectly.

Here is the code:

def url_parse(url):
    l = len(url)
    data = bytearray()
    i = 0
    while i < l:
        if url[i] != '%':
            d = ord(url[i])
            i += 1
        
        else:
            d = int(url[i+1:i+3], 16)
            i += 3
        
        data.append(d)
    
    return data.decode('utf8')

I have tested it and it works perfectly.

Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.