UnicodeEncodeError when reading a file

Question:

I am trying to read from rockyou wordlist and write all words that are >= 8 chars to a new file.

Here is the code –

def main():
    with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w') as out_file:
        for line in in_file:
            if len(line.rstrip()) < 8:
                continue
            print(line, file = out_file, end = '')
        print("done")

if __name__ == '__main__':
    main()

Some words are not utf-8.

Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 6, in main
print(line, file = out_file, end = '')
File "C:Pythonlibencodingscp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character 'u0e45' in position
0: character maps to <undefined>

Update

def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w', encoding="utf8") as out_file:
    for line in in_file:
        if len(line.rstrip()) < 8:
            continue
        out_file.write(line)
    print("done")

if __name__ == '__main__':
    main()```

Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 3, in main
for line in in_file:
File "C:Pythonlibcodecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 933: invali
d continuation byte

Asked By: Mark Evans

||

Answers:

Your UnicodeEncodeError: 'charmap' error occurs during writing to out_file (in print()).

By default, open() uses locale.getpreferredencoding() that is ANSI codepage on Windows (such as cp1252) that can’t represent all Unicode characters and 'u0e45' character in particular. cp1252 is a one-byte encoding that can represent at most 256 different characters but there are a million (1114111) Unicode characters. It can’t represent them all.

Pass encoding that can represent all the desired data e.g., encoding='utf-8' must work (as @robyschek suggested)—if your code reads utf-8 data without any errors then the code should be able to write the data using utf-8 too.


Your UnicodeDecodeError: 'utf-8' error occurs during reading in_file (for line in in_file). Not all byte sequences are valid utf-8 e.g., os.urandom(100).decode('utf-8') may fail. What to do depends on the application.

If you expect the file to be encoded as utf-8; you could pass errors="ignore" open() parameter, to ignore occasional invalid byte sequences. Or you could use some other error handlers depending on your application.

If the actual character encoding used in the file is different then you should pass the actual character encoding. bytes by themselves do not have any encoding—that metadata should come from another source (though some encodings are more likely than others: chardet can guess) e.g., if the file content is an http body then see A good way to get the charset/encoding of an HTTP response in Python

Sometimes a broken software can generate mostly utf-8 byte sequences with some bytes in a different encoding. bs4.BeautifulSoup can handle some special cases. You could also try ftfy utility/library and see if it helps in your case e.g., ftfy may fix some utf-8 variations.

Answered By: jfs

Hey I was having a similar issue, in the case of rockyou.txt wordlist, I tried a number of encodings that Python had to offer and I found that encoding = 'kio8_u' worked to read the file.

Answered By: James Hensman

None of the answers worked for me, but I found that adding errors='replace' to open() fixed it.

Answered By: user14695391