UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

Question:

I have a socket server that is supposed to receive UTF-8 valid characters from clients.

The problem is some clients (mainly hackers) are sending all the wrong kind of data over it.

I can easily distinguish the genuine client, but I am logging to files all the data sent so I can analyze it later.

Sometimes I get characters like this œ that cause the UnicodeDecodeError error.

I need to be able to make the string UTF-8 with or without those characters.


Update:

For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:

EHLO example.com
MAIL FROM: <[email protected]>
...

I was logging all of this in JSON.

Then some folks out there without good intentions decided to send all kind of junk.

That is why for my specific case it is perfectly OK to strip the non ASCII characters.

Asked By: transilvlad

||

Answers:

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace')

or

str = unicode(str, errors='ignore')

Note: This will strip out (ignore) the characters in question returning the string without them.

For me this is ideal case since I’m using it as protection against non-ASCII input which is not allowed by my application.

Alternatively: Use the open method from the codecs module to read in the file:

import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
                 errors='ignore') as fdata:
Answered By: transilvlad
>>> 'x9c'.decode('cp1252')
u'u0153'
>>> print 'x9c'.decode('cp1252')
œ

Just in case of someone has the same problem. I’am using vim with YouCompleteMe, failed to start ycmd with this error message, what I did is: export LC_CTYPE="en_US.UTF-8", the problem is gone.

Answered By: http8086

This type of issue crops up for me now that I’ve moved to Python 3. I had no idea Python 2 was simply steam rolling any issues with file encoding.

I found this nice explanation of the differences and how to find a solution after none of the above worked for me.

http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

In short, to make Python 3 behave as similarly as possible to Python 2 use:

with open(filename, encoding="latin-1") as datafile:
    # work on datafile here

However, read the article, there is no one size fits all solution.

Answered By: James McCormac

I had same problem with UnicodeDecodeError and i solved it with this line.
Don’t know if is the best way but it worked for me.

str = str.decode('unicode_escape').encode('utf-8')
Answered By: maiky_forrester

Changing the engine from C to Python did the trick for me.

Engine is C:

pd.read_csv(gdp_path, sep='t', engine='c')

‘utf-8’ codec can’t decode byte 0x92 in position 18: invalid start byte

Engine is Python:

pd.read_csv(gdp_path, sep='t', engine='python')

No errors for me.

Answered By: Doğuş

What can you do if you need to make a change to a file, but don’t know the file’s encoding? If you know the encoding is ASCII-compatible and only want to examine or modify the ASCII parts, you can open the file with the surrogateescape error handler:

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
    data = f.read()

the first,Using get_encoding_type to get the files type of encode:

import os    
from chardet import detect

# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

the second, opening the files with the type:

open(current_file, 'r', encoding = get_encoding_type, errors='ignore')
Answered By: Ivan Lee

This solution works nice when using Latin American accents, such as ‘ñ’.

I have solved this problem just by adding

df = pd.read_csv(fileName,encoding='latin1')
Answered By: Talha Rasool
Answered By: Dhinesh Kumar

If as you say you simply want to permit pure 7-bit ASCII, just discard any bytes which are not. There is no straightforward way to guess what the remote end intended them to represent anyway, without an explicitly specified encoding.

while bytes := socket.read_line_bytes():
    try:
        string = bytes.decode('us-ascii')
    except UnicodeDecodeError as exc:
        logger.warning('[%s] - rejected non-ASCII input %s' % (client, bytes.decode('us-ascii',  errors='backslashreplace'))
        socket.write(b'421 communication error - non-ASCII content rejectedrn')
        continue
    ...
Answered By: tripleee

I had the same error.

For me, Python complained about the byte "0x87". I looked it up on https://bytetool.web.app/en/ascii/code/0x87/ where it told me that this byte belong to the codec Windows-1252.

I then only added this line to the beginning of my Python file:

#-*- encoding: Windows-1252 -*-"

And all errors were gone. Before I had added this line, I had tried Pandas to import the file like this:

Df = pd.read_csv(data, sep=",", engine='python', header=0, encoding='Windows-1252')

but this returned me an error. So I changed it back to this:

Df = pd.read_csv(data, sep=",", engine='python', header=0)
Answered By: Kai
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.