File. exe in win cannot identify the file encoding, the file seems to be corrupted, what can be done?

Question:

For some files, python’s chardet library of chardet.detect(f.read())['encoding'] returns None.

path=r"C:A chinese novel.TXT"
with codecs.open(path, 'rb') as f:
    encoding=chardet.detect(f.read())
    print(encoding)
# RETURN {'encoding': None, 'confidence': 0.0, 'language': None}

I’ll use os.popen("file -bi "%s" | gawk -F'[ =]' '{print $3}'" % f).read() view file coding, the compiler hints file encoding is unknown - 8 bit

‘file xxx.txt’ output xxx.txt: Non-ISO extended-ASCII text, with very long lines (560), with CRLF line terminator

Here’s the GIf link to understand the situation: https://i.imgur.com/5kvmnRL.gif

However, Notepad++ can be opened normally, Notepad shows that the file is GB2312 encoding, and the character display is basically normal.

The file may become corrupted and so a mixed-encoding file that the chardet library cannot recognize?

Chatgpt suggested that I use iconv to re-encode the bad file, but the text editor (Notepad++) could not confirm which encoding the file is before opening. Is there a more reliable way to identify file encodings by python in windows10?

Asked By: all sky

||

Answers:

  • chardet: A very popular Python package for detecting encoding.

  • cchardet: A Python module written in C++, similar to the chardet package.

  • File-magic: A Python-wrapped libmagic library that recognizes file types and encodings.

import chardet
import cchardet
import magic

# chardet
with open('your_file_path', 'rb') as f:
    rawdata = f.read()
    result = chardet.detect(rawdata)
    encoding = result['encoding']
    print(encoding)

# cchardet
with open('your_file_path', 'rb') as f:
    rawdata = f.read()
    result = cchardet.detect(rawdata)
    encoding = result['encoding']
    print(encoding)

# file-magic
with magic.Magic() as m:
    file_type = m.id_filename('your_file_path')
    print(file_type)

After verification, cchardet recognition effect is good. It can successfully output the correct encoding format.

Answered By: all sky