Cannot convert all elements of csv file to python objects

Question:

I’m trying to convert all CSV elements to python objects using following python script, but not all characters in CSV file are in UTF-8 and I’ve to convert all those characters to readable format i.e. UTF-8. How can I achieve this?

I’ve tried converting csv file to UTF-8 using simple text editor as like this How to convert csv files encoding to utf-8 but cannot helped so.

I’m using following python file:

import csv 

filename = "file.csv"

rows = [] 

with open(filename, 'r') as csvfile: 
    csvreader = csv.reader(csvfile) 

    for row in csvreader: 
        rows.append(row) 

    print("Total no. of rows: %d"%(csvreader.line_num)) 

print('nFirst 5 rows are:n') 
for row in rows[:5]: 
    for col in row: 
        print("%10s"%col), 
    print('n') 

Python produces following errors:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 4942: invalid start byte.

Asked By: Jishan Shaikh

||

Answers:

UTF-8 is now a de-facto standard because if can represent any unicode character, but many systems (mostly Windows) still use other encodings for compatibility reasons. For example for west european languages, Windows uses cp1252 which is a Latin1 variant.

Latin1 is an interesting encoding, because any byte is valid in Latin1 and represents the unicode character of same code point. Because of that, it is the encoding to use when you want to have a bullet proof decoding and are unsure of the actual encoding. Simply if the encoding is different, you will read weird characters. For example this utf-8 encoded string “fête” (French for fest) will read 'fête' as a Latin1 encoded byte string.

So this will not break (but could give incorrect characters):

...
with open(filename, 'r', encoding='Latin1') as csvfile: 
    csvreader = csv.reader(csvfile)
...
Answered By: Serge Ballesta
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.