URL encoding/decoding with Python

Question:

I am trying to encode and store, and decode arguments in Python and getting lost somewhere along the way. Here are my steps:

1) I use google toolkit’s gtm_stringByEscapingForURLArgument to convert an NSString properly for passing into HTTP arguments.

2) On my server (python), I store these string arguments as something like u'1234567890-/:;()$&@".,?!'[]{}#%^*+=_\|~<>u20acxa3xa5u2022.,?!'' (note that these are the standard keys on an iphone keypad in the “123” view and the “#+=” view, the u and x chars in there being some monetary prefixes like pound, yen, etc)

3) I call urllib.quote(myString,'') on that stored value, presumably to %-escape them for transport to the client so the client can unpercent escape them.

The result is that I am getting an exception when I try to log the result of % escaping. Is there some crucial step I am overlooking that needs to be applied to the stored value with the u and x format in order to properly convert it for sending over http?

Update: The suggestion marked as the answer below worked for me. I am providing some updates to address the comments below to be complete, though.

The exception I received cited an issue with u20ac. I don’t know if it was a problem with that specifically, rather than the fact that it was the first unicode character in the string.

That u20ac char is the unicode for the ‘euro’ symbol. I basically found I’d have issues with it unless I used the urllib2 quote method.

Asked By: Joey

||

Answers:

You are out of your luck with stdlib, urllib.quote doesn’t work with unicode. If you are using django you can use django.utils.http.urlquote which works properly with unicode

Answered By: almir karic

url encoding a “raw” unicode doesn’t really make sense. What you need to do is .encode("utf8") first so you have a known byte encoding and then .quote() that.

The output isn’t very pretty but it should be a correct uri encoding.

>>> s = u'1234567890-/:;()$&@".,?!'[]{}#%^*+=_|~<>u20acxa3xa5u2022.,?!''
>>> urllib2.quote(s.encode("utf8"))
'1234567890-/%3A%3B%28%29%24%26%40%22.%2C%3F%21%27%5B%5D%7B%7D%23%25%5E%2A%2B%3D_%5C%7C%7E%3C%3E%E2%82%AC%C2%A3%C2%A5%E2%80%A2.%2C%3F%21%27'

Remember that you will need to both unquote() and decode() this to print it out properly if you’re debugging or whatever.

>>> print urllib2.unquote(urllib2.quote(s.encode("utf8")))
1234567890-/:;()$&@".,?!'[]{}#%^*+=_|~<>€£¥•.,?!'
>>> # oops, nasty  means we've got a utf8 byte stream being treated as an ascii stream
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8"))).decode("utf8")
1234567890-/:;()$&@".,?!'[]{}#%^*+=_|~<>€£¥•.,?!'

This is, in fact, what the django functions mentioned in another answer do.

The functions
django.utils.http.urlquote() and
django.utils.http.urlquote_plus() are
versions of Python’s standard
urllib.quote() and urllib.quote_plus()
that work with non-ASCII characters.
(The data is converted to UTF-8 prior
to encoding.)

Be careful if you are applying any further quotes or encodings not to mangle things.

Answered By: pycruft

i want to second pycruft’s remark. web protocols have evolved over decades, and dealing with the various sets of conventions can be cumbersome. now URLs happen to be explicitly not defined for characters, but only for bytes (octets). as a historical coincidence, URLs are one of the places where you can only assume, but not enforce or safely expect an encoding to be present. however, there is a convention to prefer latin-1 and utf-8 over other encodings here. for a while, it looked like ‘unicode percent escapes‘ would be the future, but they never caught on.

it is of paramount importance to be pedantically picky in this area about the difference between unicode objects and octet strings (in Python < 3.0; that’s, confusingly, str unicode objects and bytes/bytearray objects in Python >= 3.0). unfortunately, in my experience it is for a number of reasons pretty difficult to cleanly separate the two concepts in Python 2.x.

even more OT, when you want to receive third-party HTTP requests, you can not absolutely rely on URLs being sent in percent-escaped, utf-8-encoded octets: there may both be the occasional %uxxxx escape in there, and at least firefox 2.x used to encode URLs as latin-1 where possible, and as utf-8 only where necessary.

Answered By: flow
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.