Editing UTF-8 text file on Windows

Question:

I’m trying to manipulate a text file with song names. I want to clean up the data, by changing all the spaces and tabs into +.

This is the code:

input = open('music.txt', 'r')
out = open("out.txt", "w")
for line in input:
    new_line = line.replace(" ", "+")
    new_line2 = new_line.replace("t", "+")
    out.write(new_line2)
    #print(new_line2)
fh.close()
out.close()

It gives me an error:

Traceback (most recent call last):
  File "music.py", line 3, in <module>
    for line in input:
  File "C:UsersnfeydAppDataLocalProgramsPythonPython36libencodingscp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2126: character maps to <undefined>

As music.txt is saved in UTF-8, I changed the first line to:

input = open('music.txt', 'r', encoding="utf8")

This gives another error:

UnicodeEncodeError: 'charmap' codec can't encode character 'u039b' in position 21: character maps to <undefined>

I tried other things with the out.write() but it didn’t work.

This is the raw data of music.txt.
https://pastebin.com/FVsVinqW

I saved it in windows editor as UTF-8 .txt file.

Asked By: Loewe8

||

Answers:

If your system’s default encoding is not UTF-8, you will need to explicitly configure it for both the filehandles you open, on legacy versions of Python 3 on Windows.

with open('music.txt', 'r', encoding='utf-8') as infh,
        open("out.txt", "w", encoding='utf-8') as outfh:
    for line in infh:
        line = line.replace(" ", "+").replace("t", "+")
        outfh.write(line)

This demonstrates how you can use fewer temporary variables for the replacements; I also refactored to use a with context manager, and renamed the file handle variables to avoid shadowing the built-in input function.

Going forward, perhaps a better solution would be to upgrade your Python version; my understanding is that Python should now finally offer UTF-8 by default on Windows, too.

Answered By: tripleee