Strange character added when decoding with urllib

Question:

I’m trying to parse a query string like this:
filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt

Since it mixes bytes and text, I tried to alter it such that it will produce the desired escaped url output like so:

    extended = 'filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt'
    fixbytes = bytes(extended, 'utf-8')
    fixbytes = fixbytes.decode("unicode_escape")
    algoext = '?' + urllib.parse.quote(fixbytes, safe='?&=')

This outputs
b'filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt'

filename=logo.txtx&filename=.hidden.txt

?filename=logo.txt%C2%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%01x&filename=.hidden.txt

Where does the %C2 byte come from? It’s not in the source string and it’s not in any of the intermediate steps. What could I do other than manually remove it from the final output string?

P.S. I’m relying on a library to generate the string so changing the way it’s represented initially is not an option.

Asked By: Iaotle

||

Answers:

As the docs for urllib.parse.quote say

Note that quote(string, safe, encoding, errors) is equivalent to quote_from_bytes(string.encode(encoding, errors), safe).

Where encoding defaults to UTF-8. And the UTF-8 encoding of ‘x80’ is…

>>> 'x80'.encode('utf-8')
b'xc2x80'

So it’s correct that the %C2 is there. You shouldn’t remove it.

Answered By: Sören

Maybe this is what you want:

>>> extended = 'filename=logo.txt\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01x&filename=.hidden.txt'
>>> fixbytes = bytes(extended, 'utf-8')
>>> fixbytes = fixbytes.decode("unicode_escape")
>>> fixbytes = fixbytes.encode("latin-1")
>>> fixbytes
b'filename=logo.txtx80x00x00x00x00x00x00x00x00x00x00x00x00x00x00x01x&filename=.hidden.txt'
>>> algoext = '?' + urllib.parse.quote(fixbytes, safe='?&=')
>>> algoext
'?filename=logo.txt%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%01x&filename=.hidden.txt'

Latin-1 is a legacy encoding that maps the codepoints 0-255 to the bytes 0-255. But really: If this is what you need, you should fix both whatever arcane process produced your mojibake in the first place AND the server that doesn’t accept UTF-8 in 2022.

Answered By: Sören

Also achieves my goal:

querystring = '?' + extended.replace('\x', '%')
Answered By: Iaotle