urllib.urlencode doesn't like unicode values: how about this workaround?

Question:

If I have an object like:

d = {'a':1, 'en': 'hello'}

…then I can pass it to urllib.urlencode, no problem:

percent_escaped = urlencode(d)
print percent_escaped

But if I try to pass an object with a value of type unicode, game over:

d2 = {'a':1, 'en': 'hello', 'pt': u'olá'}
percent_escaped = urlencode(d2)
print percent_escaped # This fails with a UnicodeEncodingError

So my question is about a reliable way to prepare an object to be passed to urlencode.

I came up with this function where I simply iterate through the object and encode values of type string or unicode:

def encode_object(object):
  for k,v in object.items():
    if type(v) in (str, unicode):
      object[k] = v.encode('utf-8')
  return object

This seems to work:

d2 = {'a':1, 'en': 'hello', 'pt': u'olá'}
percent_escaped = urlencode(encode_object(d2))
print percent_escaped

And that outputs a=1&en=hello&pt=%C3%B3la, ready for passing to a POST call or whatever.

But my encode_object function just looks really shaky to me. For one thing, it doesn’t handle nested objects.

For another, I’m nervous about that if statement. Are there any other types that I should be taking into account?

And is comparing the type() of something to the native object like this good practice?

type(v) in (str, unicode) # not so sure about this...

Thanks!

Asked By: user18015

||

Answers:

It seems that you can’t pass a Unicode object to urlencode, so, before calling it, you should encode every unicode object parameter. How you do this in a proper way seems to me very dependent on the context, but in your code you should always be aware of when to use the unicode python object (the unicode representation) and when to use the encoded object (bytestring).

Also, encoding the str values is “superfluous”: What is the difference between encode/decode?

Answered By: Javier

You should indeed be nervous. The whole idea that you might have a mixture of bytes and text in some data structure is horrifying. It violates the fundamental principle of working with string data: decode at input time, work exclusively in unicode, encode at output time.

Update in response to comment:

You are about to output some sort of HTTP request. This needs to be prepared as a byte string. The fact that urllib.urlencode is not capable of properly preparing that byte string if there are unicode characters with ordinal >= 128 in your dict is indeed unfortunate. If you have a mixture of byte strings and unicode strings in your dict, you need to be careful. Let’s examine just what urlencode() does:

>>> import urllib
>>> tests = ['x80', 'xe2x82xac', 1, '1', u'1', u'x80', u'u20ac']
>>> for test in tests:
...     print repr(test), repr(urllib.urlencode({'a':test}))
...
'x80' 'a=%80'
'xe2x82xac' 'a=%E2%82%AC'
1 'a=1'
'1' 'a=1'
u'1' 'a=1'
u'x80'
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:python27liburllib.py", line 1282, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode character u'x80' in position 0: ordinal not in range(128)

The last two tests demonstrate the problem with urlencode(). Now let’s look at the str tests.

If you insist on having a mixture, then you should at the very least ensure that the str objects are encoded in UTF-8.

‘x80’ is suspicious — it is not the result of any_valid_unicode_string.encode(‘utf8’).
‘xe2x82xac’ is OK; it’s the result of u’u20ac’.encode(‘utf8’).
‘1’ is OK — all ASCII characters are OK on input to urlencode(), which will percent-encode such as ‘%’ if necessary.

Here’s a suggested converter function. It doesn’t mutate the input dict as well as returning it (as yours does); it returns a new dict. It forces an exception if a value is a str object but is not a valid UTF-8 string. By the way, your concern about it not handling nested objects is a little misdirected — your code works only with dicts, and the concept of nested dicts doesn’t really fly.

def encoded_dict(in_dict):
    out_dict = {}
    for k, v in in_dict.iteritems():
        if isinstance(v, unicode):
            v = v.encode('utf8')
        elif isinstance(v, str):
            # Must be encoded in UTF-8
            v.decode('utf8')
        out_dict[k] = v
    return out_dict

and here’s the output, using the same tests in reverse order (because the nasty one is at the front this time):

>>> for test in tests[::-1]:
...     print repr(test), repr(urllib.urlencode(encoded_dict({'a':test})))
...
u'u20ac' 'a=%E2%82%AC'
u'x80' 'a=%C2%80'
u'1' 'a=1'
'1' 'a=1'
1 'a=1'
'xe2x82xac' 'a=%E2%82%AC'
'x80'
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 8, in encoded_dict
  File "C:python27libencodingsutf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
>>>

Does that help?

Answered By: John Machin

Nothing new to add except to point out that the urlencode algorithm is nothing tricky.
Rather than processing your data once and then calling urlencode on it, it would be perfectly fine to do something like:

from urllib import quote_plus

def urlencode_utf8(params):
    if hasattr(params, 'items'):
        params = params.items()
    return '&'.join(
        (quote_plus(k.encode('utf8'), safe='/') + '=' + quote_plus(v.encode('utf8'), safe='/')
            for k, v in params))

Looking at the source code for the urllib module (Python 2.6), their implementation does not do much more. There is an optional feature where values in the parameters that are themselves 2-tuples are turned into separate key-value pairs, which is sometimes useful, but if you know you won’t need that, the above will do.

You can even get rid of the if hasattr('items', params): if you know you won’t need to handle lists of 2-tuples as well as dicts.

Answered By: ejm

Why so long answers?

urlencode(unicode_string.encode('utf-8'))

Answered By: Pavel Vlasov

Seems like it is a wider topic than it looks, especially when you have to deal with more complex dictionary values. I found 3 ways of solving the problem:

  1. Patch urllib.py to include encoding parameter:

    def urlencode(query, doseq=0, encoding='ascii'):
    

    and replace all str(v) conversions to something like v.encode(encoding)

    Obviously not good, since it’s hardly redistributable and even harder to maintain.

  2. Change default Python encoding as described here. The author of the blog pretty clearly describes some problems with this solution and who knows how more of them could be lurking in the shadows. So it doesn’t look good to me either.

  3. So I, personally, ended up with this abomination, which encodes all unicode strings to UTF-8 byte strings in any (reasonably) complex structure:

    def encode_obj(in_obj):
    
        def encode_list(in_list):
            out_list = []
            for el in in_list:
                out_list.append(encode_obj(el))
            return out_list
    
        def encode_dict(in_dict):
            out_dict = {}
            for k, v in in_dict.iteritems():
                out_dict[k] = encode_obj(v)
            return out_dict
    
        if isinstance(in_obj, unicode):
            return in_obj.encode('utf-8')
        elif isinstance(in_obj, list):
            return encode_list(in_obj)
        elif isinstance(in_obj, tuple):
            return tuple(encode_list(in_obj))
        elif isinstance(in_obj, dict):
            return encode_dict(in_obj)
    
        return in_obj
    

    You can use it like this: urllib.urlencode(encode_obj(complex_dictionary))

    To encode keys also, out_dict[k] can be replaced with out_dict[k.encode('utf-8')], but it was a bit too much for me.

Answered By: ogurets

I solved it with this add_get_to_url() method:

import urllib

def add_get_to_url(url, get):
   return '%s?%s' % (url, urllib.urlencode(list(encode_dict_to_bytes(get))))

def encode_dict_to_bytes(query):
    if hasattr(query, 'items'):
        query=query.items()
    for key, value in query:
        yield (encode_value_to_bytes(key), encode_value_to_bytes(value))

def encode_value_to_bytes(value):
    if not isinstance(value, unicode):
        return str(value)
    return value.encode('utf8')

Features:

  • “get” can be a dict or a list of (key, value) pairs
  • Order is not lost
  • values can be integers or other simple datatypes.

Feedback welcome.

Answered By: guettli

I had the same problem with German “Umlaute”.
The solution is pretty simple:

In Python 3+, urlencode allows to specify the encoding:

from urllib import urlencode
args = {}
args = {'a':1, 'en': 'hello', 'pt': u'olá'}
urlencode(args, 'utf-8')

>>> 'a=1&en=hello&pt=ol%3F'
Answered By: Saskia Vola

this one line working fine in my case –>

urllib.quote(unicode_string.encode('utf-8'))

thanks @IanCleland and @PavelVlasov

Answered By: fredy kardian
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.