Converting Broken String with Python
Question:
my Python code :
cursor = conn.cursor()
cursor.execute("select * from %s" % table_name)
row = cursor.fetchall()
data = ( [tuple(el.encode('latin1').decode('euc-kr') for el in t) for t in row] )
# Open CSV file for writing.
csvFile = csv.writer(open(filePath + fileName, 'w', newline='', encoding='utf-8'),
delimiter=',', lineterminator='rn',
quoting=csv.QUOTE_ALL, escapechar='\')
csvFile.writerows(data)
Convert euckr data to utf8 to create a csv file
Normal data is converted.
Broken characters cannot be converted, how should I deal with them?
Broken characters exmaple : 뚦 딺똚
Error message when executing code :
Traceback (most recent call last):
File "test.py", line 42, in <module>
batch_extrat('test_table')
File "test.py", line 30, in batch_extrat
data = ( [tuple(el.encode('latin1').decode('euc-kr') for el in t) for t in row] )
File "test.py", line 30, in <listcomp>
data = ( [tuple(el.encode('latin1').decode('euc-kr') for el in t) for t in row] )
File "test.py", line 30, in <genexpr>
data = ( [tuple(el.encode('latin1').decode('euc-kr') for el in t) for t in row] )
UnicodeDecodeError: 'euc_kr' codec can't decode byte 0x8c in position 0: illegal multibyte sequence
If I can’t convert broken letters, I want to convert them into "?"
Answers:
You can use the replace
error handler
# Replace invalid characters with '?' using the 'replace' error handler.
decoded_t = tuple(el.encode('latin1').decode('euc-kr', errors='replace') for el in t)
The replace
error handler to replace any invalid or undefined characters with the ‘?’ placeholder
my Python code :
cursor = conn.cursor()
cursor.execute("select * from %s" % table_name)
row = cursor.fetchall()
data = ( [tuple(el.encode('latin1').decode('euc-kr') for el in t) for t in row] )
# Open CSV file for writing.
csvFile = csv.writer(open(filePath + fileName, 'w', newline='', encoding='utf-8'),
delimiter=',', lineterminator='rn',
quoting=csv.QUOTE_ALL, escapechar='\')
csvFile.writerows(data)
Convert euckr data to utf8 to create a csv file
Normal data is converted.
Broken characters cannot be converted, how should I deal with them?
Broken characters exmaple : 뚦 딺똚
Error message when executing code :
Traceback (most recent call last):
File "test.py", line 42, in <module>
batch_extrat('test_table')
File "test.py", line 30, in batch_extrat
data = ( [tuple(el.encode('latin1').decode('euc-kr') for el in t) for t in row] )
File "test.py", line 30, in <listcomp>
data = ( [tuple(el.encode('latin1').decode('euc-kr') for el in t) for t in row] )
File "test.py", line 30, in <genexpr>
data = ( [tuple(el.encode('latin1').decode('euc-kr') for el in t) for t in row] )
UnicodeDecodeError: 'euc_kr' codec can't decode byte 0x8c in position 0: illegal multibyte sequence
If I can’t convert broken letters, I want to convert them into "?"
You can use the replace
error handler
# Replace invalid characters with '?' using the 'replace' error handler.
decoded_t = tuple(el.encode('latin1').decode('euc-kr', errors='replace') for el in t)
The replace
error handler to replace any invalid or undefined characters with the ‘?’ placeholder