Decode escaped characters in URL
Question:
I have a list containing URLs with escaped characters in them. Those characters have been set by urllib2.urlopen
when it recovers the html page:
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh
Is there a way to transform them back to their unescaped form in python?
P.S.: The URLs are encoded in utf-8
Answers:
Using urllib
package (import urllib
) :
Python 2.7
From official documentation :
urllib.unquote(string)
Replace %xx
escapes by their single-character equivalent.
Example: unquote('/%7Econnolly/')
yields '/~connolly/'
.
Python 3
From official documentation :
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
[…]
Example: unquote('/El%20Ni%C3%B1o/')
yields '/El Niño/'
.
You can use urllib.unquote
import re
def unquote(url):
return re.compile('%([0-9a-fA-F]{2})',re.M).sub(lambda m: chr(int(m.group(1),16)), url)
or urllib.unquote_plus
>>> import urllib
>>> urllib.unquote('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte+membrane+protein+1,+PfEMP1+(VAR)'
>>> urllib.unquote_plus('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte membrane protein 1, PfEMP1 (VAR)'
And if you are using Python3
you could use:
import urllib.parse
urllib.parse.unquote(url)
I have a list containing URLs with escaped characters in them. Those characters have been set by urllib2.urlopen
when it recovers the html page:
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh
Is there a way to transform them back to their unescaped form in python?
P.S.: The URLs are encoded in utf-8
Using urllib
package (import urllib
) :
Python 2.7
From official documentation :
urllib.unquote(string)
Replace
%xx
escapes by their single-character equivalent.Example:
unquote('/%7Econnolly/')
yields'/~connolly/'
.
Python 3
From official documentation :
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
[…]
Example:
unquote('/El%20Ni%C3%B1o/')
yields'/El Niño/'
.
You can use urllib.unquote
import re
def unquote(url):
return re.compile('%([0-9a-fA-F]{2})',re.M).sub(lambda m: chr(int(m.group(1),16)), url)
or urllib.unquote_plus
>>> import urllib
>>> urllib.unquote('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte+membrane+protein+1,+PfEMP1+(VAR)'
>>> urllib.unquote_plus('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte membrane protein 1, PfEMP1 (VAR)'
And if you are using Python3
you could use:
import urllib.parse
urllib.parse.unquote(url)