Strange character added when decoding with urllib
Question:
I’m trying to parse a query string like this:
filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt
Since it mixes bytes and text, I tried to alter it such that it will produce the desired escaped url output like so:
extended = 'filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt'
fixbytes = bytes(extended, 'utf-8')
fixbytes = fixbytes.decode("unicode_escape")
algoext = '?' + urllib.parse.quote(fixbytes, safe='?&=')
This outputs
b'filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt'
filename=logo.txtx&filename=.hidden.txt
?filename=logo.txt%C2%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%01x&filename=.hidden.txt
Where does the %C2 byte come from? It’s not in the source string and it’s not in any of the intermediate steps. What could I do other than manually remove it from the final output string?
P.S. I’m relying on a library to generate the string so changing the way it’s represented initially is not an option.
Answers:
As the docs for urllib.parse.quote say
Note that quote(string, safe, encoding, errors) is equivalent to quote_from_bytes(string.encode(encoding, errors), safe).
Where encoding defaults to UTF-8. And the UTF-8 encoding of ‘x80’ is…
>>> 'x80'.encode('utf-8')
b'xc2x80'
So it’s correct that the %C2 is there. You shouldn’t remove it.
Maybe this is what you want:
>>> extended = 'filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt'
>>> fixbytes = bytes(extended, 'utf-8')
>>> fixbytes = fixbytes.decode("unicode_escape")
>>> fixbytes = fixbytes.encode("latin-1")
>>> fixbytes
b'filename=logo.txtx80x00x00x00x00x00x00x00x00x00x00x00x00x00x00x01x&filename=.hidden.txt'
>>> algoext = '?' + urllib.parse.quote(fixbytes, safe='?&=')
>>> algoext
'?filename=logo.txt%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%01x&filename=.hidden.txt'
Latin-1 is a legacy encoding that maps the codepoints 0-255 to the bytes 0-255. But really: If this is what you need, you should fix both whatever arcane process produced your mojibake in the first place AND the server that doesn’t accept UTF-8 in 2022.
Also achieves my goal:
querystring = '?' + extended.replace('\x', '%')
I’m trying to parse a query string like this:
filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt
Since it mixes bytes and text, I tried to alter it such that it will produce the desired escaped url output like so:
extended = 'filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt'
fixbytes = bytes(extended, 'utf-8')
fixbytes = fixbytes.decode("unicode_escape")
algoext = '?' + urllib.parse.quote(fixbytes, safe='?&=')
This outputs
b'filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt'
filename=logo.txtx&filename=.hidden.txt
?filename=logo.txt%C2%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%01x&filename=.hidden.txt
Where does the %C2 byte come from? It’s not in the source string and it’s not in any of the intermediate steps. What could I do other than manually remove it from the final output string?
P.S. I’m relying on a library to generate the string so changing the way it’s represented initially is not an option.
As the docs for urllib.parse.quote say
Note that quote(string, safe, encoding, errors) is equivalent to quote_from_bytes(string.encode(encoding, errors), safe).
Where encoding defaults to UTF-8. And the UTF-8 encoding of ‘x80’ is…
>>> 'x80'.encode('utf-8')
b'xc2x80'
So it’s correct that the %C2 is there. You shouldn’t remove it.
Maybe this is what you want:
>>> extended = 'filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt'
>>> fixbytes = bytes(extended, 'utf-8')
>>> fixbytes = fixbytes.decode("unicode_escape")
>>> fixbytes = fixbytes.encode("latin-1")
>>> fixbytes
b'filename=logo.txtx80x00x00x00x00x00x00x00x00x00x00x00x00x00x00x01x&filename=.hidden.txt'
>>> algoext = '?' + urllib.parse.quote(fixbytes, safe='?&=')
>>> algoext
'?filename=logo.txt%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%01x&filename=.hidden.txt'
Latin-1 is a legacy encoding that maps the codepoints 0-255 to the bytes 0-255. But really: If this is what you need, you should fix both whatever arcane process produced your mojibake in the first place AND the server that doesn’t accept UTF-8 in 2022.
Also achieves my goal:
querystring = '?' + extended.replace('\x', '%')