How to detect string byte encoding?

Question:

I’ve got about 1000 filenames read by os.listdir(), some of them are encoded in UTF8 and some are CP1252.

I want to decode all of them to Unicode for further processing in my script. Is there a way to get the source encoding to correctly decode into Unicode?

Example:

for item in os.listdir(rootPath):

    #Convert to Unicode
    if isinstance(item, str):
        item = item.decode('cp1252')  # or item = item.decode('utf-8')
    print item
Asked By: Philipp

||

Answers:

if your files either in cp1252 and utf-8, then there is an easy way.

import logging
def force_decode(string, codecs=['utf8', 'cp1252']):
    for i in codecs:
        try:
            return string.decode(i)
        except UnicodeDecodeError:
            pass

    logging.warn("cannot decode url %s" % ([string]))

for item in os.listdir(rootPath):
    #Convert to Unicode
    if isinstance(item, str):
        item = force_decode(item)
    print item

otherwise, there is a charset detect lib.

Python – detect charset and convert to utf-8

https://pypi.python.org/pypi/chardet

Answered By: lucemia

Use chardet library. It is super easy

import chardet

the_encoding = chardet.detect('your string')['encoding']

and that’s it!

in python3 you need to provide type bytes or bytearray so:

import chardet
the_encoding = chardet.detect(b'your string')['encoding']
Answered By: george

You also can use json package to detect encoding.

import json

json.detect_encoding(b"Hello")
Answered By: Suyog Shimpi

chardet detected encoding can be used to decode an bytearray without any exception, but the output string may not be correct.

The try ... except ... way works perfectly for known encodings, but it does not work for all scenarios.

We can use try ... except ... first and then chardet as plan B:

    def decode(byte_array: bytearray, preferred_encodings: List[str] = None):
        if preferred_encodings is None:
            preferred_encodings = [
                'utf8',       # Works for most cases
                'cp1252'      # Other encodings may appear in your project
            ]

        for encoding in preferred_encodings:
            # Try preferred encodings first
            try:
                return byte_array.decode(encoding)
            except UnicodeDecodeError:
                pass
        else:
            # Use detected encoding
            encoding = chardet.detect(byte_array)['encoding']
            return byte_array.decode(encoding)

Answered By: Shawn Hu

I tried with both json and chardet, and I got these results:

import json
import chardet

data = b'xa9 2023'
json.detect_encoding(data)  # 'utf-8'
data.decode('utf-8')  # UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte

chardet.detect(data)  # {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
data.decode("ISO-8859-1")  # '© 2023'
Answered By: jakobdo

charset_normalizer is a drop in replacement for chardet.

It works better on natural language and has a permissive MIT licence: https://github.com/Ousret/charset_normalizer/

from charset_normalizer import detect
encoding = detect(byte_string)['encoding']

PS: This is not strictly related to the original question but this page comes up in Google a lot

Answered By: Dawars