Python opening files with utf-8 file names

Question:

In my code I used something like file = open(path +'/'+filename, 'wb') to write the file
but in my attempt to support non-ascii filenames, I encode it as such

naming = path+'/'+filename
file = open(naming.encode('utf-8', 'surrogateescape'), 'wb')
write binary data...

so the file is named something like directory/path/xd8xb9xd8xb1xd8xa8xd9.txt
and it works, but the issue arises when I try to get that file again by crawling into the same directory using:

for file in path:
    data = open(file.as_posix(), 'rb)
    ...

I keep getting this error 'ascii' codec can't encode characters in position..
I tried converting the string to bytes like data = open(bytes(file.as_posix(), encoding='utf-8'), 'rb') but I get 'utf-8' codec can't encode characters in position...'

I also tried file.as_posix().encode('utf-8', 'surrogateescape'), I found that both encode and print just fine but with open() I still get the error 'utf-8' codec can't encode characters in position...'

How can I open a file with a utf-8 filename?

I’m using Python 3.9 on ubuntu linux

Any help is greatly appreciated.

EDIT

I figured out why the issue happens when crawling to the directory after writing.
So, when I write the file and give it the raw string directory/path/xd8xb9xd8xb1xd8xa8xd9.txt and encode the string to utf, it writes fine.
But when finding the file again by crawling into the directory the str(filepath) or filepath.as_posix() returns the string as directory/path/????????.txt so it gives me an error when I try to encode it to any codec.

Currently I’m investigating if the issue’s related to my linux locale, it was set to POSIX, I changed it to C.UTF-8 but still no luck atm.

More context: this is a file system where the file is uploaded through a site, so I receive the filename string in utf-8 format

Asked By: spospider

||

Answers:

I don’t understand why you feel you need to recode filepaths.

Linux (unix) filenames are just sequences of bytes (with a couple of prohibited byte values). There’s no need to break astral characters in surrogate pairs; the UTF-8 sequence for an astral character is perfectly acceptable in a filename. But creating surrogate pairs is likely to get you into trouble, because there’s no UTF-8 encoding for a surrogate. So if you actually manage to create something that looks like the UTF-8 encoding for a surrogate codepoint, you’re likely to encounter a decoding error when you attempt to turn it back into a Unicode codepoint.

Anyway, there’s no need to go to all that trouble. Before running this session, I created a directory called ´ñ´ with two empty files, and mañana. The first one is an astral character, U+1D510. As you can see, everything works fine, with no need for manual decoding.

>>> [*Path('ñ').iterdir()]
[PosixPath('ñ/ '), PosixPath('ñ/mañana')]
>>> Path.mkdir('ñ2')
>>> for path in Path('ñ').iterdir():
...   open(Path('ñ2', path.name), 'w').close()
...
>>> [*Path('ñ2').iterdir()]
[PosixPath('ñ2/ '), PosixPath('ñ2/mañana')]
>>> [open(path).read() for path in Path('ñ2').iterdir()] 
['', '']

Note:

In a comment, OP says that they had previously tried:

file = open('/upload/xd8xb9xd8xb1xd8xa8xd9x8a.png', 'wb')

and received the error

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-11: ordinal not in range(128)

Without more details, it’s hard to know how to respond to that. It’s possible that open will raise that error for a filesystem which doesn’t allow non-ascii characters, but that wouldn’t be normal on Linux.

However, it’s worth noting that the string literal

'/upload/xd8xb9xd8xb1xd8xa8xd9x8a.png'

is not the string you think it is. x escapes in a Python string are Unicode codepoints (with a maximum value of 255), not individual UTF-8 byte values. The Python string literal, "xd8xb9" contains two characters, "O with stroke" (Ø) and "superscript 1" (¹); in other words, it is exactly the same as the string literal "u00d8u00b9".

To get the Arabic letter ain (ع), either just type it (if you have an Arabic keyboard setting and your source file encoding is UTF-8, which is the default), or use a Unicode escape for its codepoint U+0639: "u0639".

If for some reason you insist on using explicit UTF-8 byte encoding, you can use a byte literal as the argument to open:

file = open(b'/upload/xd8xb9xd8xb1xd8xa8xd9x8a.png', 'wb')

But that’s not recommended.

Answered By: rici

So after being in a rabbit hole for the past few days, I figured the issue isn’t with python itself but with the locale that my web framework was using. Debugging this, I saw that

import sys
print(sys.getfilesystemencoding())

returned ‘ASCII’, which was weird considering I had set the linux locale to C.UTF-8 but discovered that since I was running WSGI on Apache2, I had to add locale to my WSGI as such WSGIDaemonProcess my_app locale='C.UTF-8' in the Apache configuration file thanks to this post.

Answered By: spospider
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.