How can I replace Unicode characters with Turkish characters in a text file with Python

Question

I am working on Twitter. I got data from Twitter with Stream API and the result of app is JSON file. I wrote tweets data in a text file and now I see Unicode characters instead of Turkish characters. I don’t want to do find/replace in Notepad++ by hand. Is there any automatic option to replace characters by opening txt file, reading all data in file and changing Unicode characters with Turkish characters by Python?

Here are Unicode characters and Turkish characters which I want to replace.

ğ – u011f
Ğ – u011e
ı – u0131
İ – u0130
ö – u00f6
Ö – u00d6
ü – u00fc
Ü – u00dc
ş – u015f
Ş – u015e
ç – u00e7
Ç – u00c7

I tried two different type

#!/usr/bin/env python

# -*- coding: utf-8 -*- 

import re

dosya = open('veri.txt', 'r')

for line in dosya:
    match = re.search(line, "u011f")
    if (match):
        replace("u011f", "ğ")

dosya.close()

and:

#!/usr/bin/env python

# -*- coding: utf-8 -*- 

f1 = open('veri.txt', 'r')
f2 = open('veri2.txt', 'w')

for line in f1:
    f2.write=(line.replace('u011f', 'ğ')) 
    f2.write=(line.replace('u011e', 'Ğ'))
    f2.write=(line.replace('u0131', 'ı'))
    f2.write=(line.replace('u0130', 'İ'))
    f2.write=(line.replace('u00f6', 'ö'))
    f2.write=(line.replace('u00d6', 'Ö'))
    f2.write=(line.replace('u00fc', 'ü'))
    f2.write=(line.replace('u00dc', 'Ü'))
    f2.write=(line.replace('u015f', 'ş'))
    f2.write=(line.replace('u015e', 'Ş'))
    f2.write=(line.replace('u00e7', 'ç'))
    f2.write=(line.replace('u00c7', 'Ç'))

f1.close()
f2.close()

Both of these didn’t work.
How can I make it work?

Asked By: S.SavaS

||

Source

Answer 1

JSON allows both “escaped” and “unescaped” characters. The reason why the Twitter API returns only escaped characters is that it can use the ASCII encoding, which increases interoperability. For Turkish characters you need another encoding. Opening a file with the open function opens a file assuming your current locale encoding, which is probably what your editor expects. If you want the output file to have e.g. the ISO-8859-9 encoding, you can pass encoding='ISO-8859-9‘ as an additional parameter to the open function.

You can read a file containing a JSON object with the json.load function. This returns a Python object with the escaped characters decoded. Writing it again with json.dump and passing ensure_ascii=False as an argument writes the object back to a file without encoding Turkish characters as escape sequences. An example:

import json
inp = open('input.txt', 'r')
out = open('output.txt', 'w')
in_as_obj = json.load(inp)
json.dump(in_as_obj, out, ensure_ascii=False)

Your file isn’t really a JSON file, but instead a file containing multiple JSON objects. If each JSON object is on its own line, you can try the following:

import json
inp = open('input.txt', 'r')
out = open('output.txt', 'w')
for line in inp:
    if not line.strip():
        out.write(line)
        continue
    in_as_obj = json.loads(line)
    json.dump(in_as_obj, out, ensure_ascii=False)
    out.write('n')

But in your case it’s probably better to write unescaped JSON to the file in the first place. Try replacing your on_data method by (untested):

def on_data(self, raw_data):
    data = json.loads(raw_data)
    print(json.dumps(data, ensure_ascii=False))

Answered By: Manuel Jacob

Answer 2

You can use this method:

# For Turkish Character
translationTable = str.maketrans("ğĞıİöÖüÜşŞçÇ", "gGiIoOuUsScC")

yourText = "Pijamalı Hasta Yağız Şoföre Çabucak Güvendi"
yourText = yourText.translate(translationTable)

print(yourText)

Answered By: Hasan Eren Keskin

Answer 3

The zip() function is enough. It takes iterables and aggregates them in a tuple. And returns it.

cumle = "Pijamalı Hasta Yağız Şoföre Çabucak Güvendi"

tr_array = list("ğĞıİöÖüÜşŞçÇ")
en_array = list("gGiIoOuUsScC")

for turkce, ingilizce in zip(tr_array, en_array):
    cumle = cumle.replace(turkce, ingilizce)

print(cumle)

Answered By: Umut D.

Answer 4

import json


def convert_turkish_to_english(text):
    translation_table = str.maketrans("ğĞıİöÖüÜşŞçÇ", "gGiIoOuUsScC")
    return text.translate(translation_table)

with open('data.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

modified_data = {}

for key, value in data.items():
    
    converted_key = convert_turkish_to_english(key)
    converted_value = convert_turkish_to_english(value)
    
    modified_data[converted_key] = converted_value

with open('data_v2.json', 'w', encoding='utf-8') as file:
    json.dump(modified_data, file, ensure_ascii=False, indent=4)

Answered By: Zerzavot

How can I replace Unicode characters with Turkish characters in a text file with Python

Question:

Answers: