text to csv file for japanese characters(Errors in arrangements)
Question:
I wanted to convert my text file into csv file, however, my output seems to be very different from what I expected. Below are the examples:
text.txt (Encoding is “UTF-8”)
text =
-0.00010712468871868001 gram_0:Coll:0::ん
-0.00010712468871868001 gram-1:Coll:-1::止まる
-0.00010712468871868001 gram-3:Coll:-3::帰る
-0.00010712468871868001 gram1:Coll:0::ん
-0.00010712468871868001 gram2:Coll:2::いく
-0.00010712468871868001 gram3:Coll:3::く
My code:
import csv
with open('text.txt', 'r', encoding="utf-8") as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split(",") for line in stripped if line)
with open('log.csv', 'w', encoding="utf-8") as out_file:
writer = csv.writer(out_file)
writer.writerow(('title', 'intro'))
writer.writerows(lines)
Output:
My expected output:
It seems like I am getting quite a lot of ……. for the japanese characters. Could anyone please assist me on this?
Answers:
Windows use the BOM to determine encoding of text, but Python does not seem to auto-generate the BOM, and Windows may recognize the output file as ANSI. Try adding out_file.write('ufeff')
immediately after the inner with
.
Source: Adding BOM (unicode signature) while saving file in python
I wanted to convert my text file into csv file, however, my output seems to be very different from what I expected. Below are the examples:
text.txt (Encoding is “UTF-8”)
text =
-0.00010712468871868001 gram_0:Coll:0::ん
-0.00010712468871868001 gram-1:Coll:-1::止まる
-0.00010712468871868001 gram-3:Coll:-3::帰る
-0.00010712468871868001 gram1:Coll:0::ん
-0.00010712468871868001 gram2:Coll:2::いく
-0.00010712468871868001 gram3:Coll:3::く
My code:
import csv
with open('text.txt', 'r', encoding="utf-8") as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split(",") for line in stripped if line)
with open('log.csv', 'w', encoding="utf-8") as out_file:
writer = csv.writer(out_file)
writer.writerow(('title', 'intro'))
writer.writerows(lines)
Output:
My expected output:
It seems like I am getting quite a lot of ……. for the japanese characters. Could anyone please assist me on this?
Windows use the BOM to determine encoding of text, but Python does not seem to auto-generate the BOM, and Windows may recognize the output file as ANSI. Try adding out_file.write('ufeff')
immediately after the inner with
.
Source: Adding BOM (unicode signature) while saving file in python