Python – how to convert unicode filename to CP437?

Question:

I have a file that has a Unicode name, say 'קובץ.txt'. I want to pack him, and I’m using python’s zipfile.

I can zip the files and open them later on with a problem except that file names are messed up when using windows 7 file explorer to view the files (7zip works great).

According to the docs, this is a common problem, and there are instructions on how to deal with that:

From ZipFile.write

Note

There is no official file name encoding for ZIP files. If you have
unicode file names, you must convert them to byte strings in your
desired encoding
before passing them to write(). WinZip interprets all
file names as encoded in CP437, also known as DOS Latin.

Sorry, but I can’t seem to get what exactly am I supposed to do with the filename. I’ve tried .encode('CP437'), .decode('CP437')..

Asked By: A-Palgy

||

Answers:

You’d have to encode your Unicode string to CP437. However, you can’t encode your specific example because the CP437 codec does not support Hebrew:

>>> u'קובץ.txt'.encode('cp437')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>

The above error tells you that the first 4 characters (קובץ) cannot be encoded because there are no such characters in the target characterset. CP437 only supports the western alphabet (A-Z, and accented characters like ç and é), IBM line drawing characters (such as ╚ and ┤) and a few greek symbols, mainly for math equations (such as Σ and φ).

You’ll either have to generate a different filename that only uses characters supported by the CP437 codec or live with the fact that WinZip will never be able to show Hebrew filenames properly, and simply stick with the characterset that did work for you with 7zip.

Answered By: Martijn Pieters

try this

import zipfile
p=b'xd7xa7xd7x95xd7x91xd7xa5.txt'.decode('utf8')
# or just:
# p='קובץ.txt'
z=zipfile.ZipFile('test.zip','w')
f=z.open(p.encode('utf8').decode('cp437'),'w')
f.write(b'hello world')
f.close()
z.close()

I’ve tried on a MacOSX, so it’s not cp437 above, but utf8, and it works

I hope this works on windows

I’ve tested reading Chinese filenames with “gbk” or “gb18030” encoding with similar codes. And it works well.

When you have a zip archive from (or needs to send it to) Mac/Linux, change cp437 in the code to utf8 and everything works

When you have a zip archive from (or needs to send it to) Windows, leave cp437 unchanged

Answered By: cdarlint

For CP866 (Russian) this works:

    from zipfile import ZipFile, ZipInfo

    class ZipInf(ZipInfo):
        def __init__(self, filename):
            super().__init__(filename)
            self.create_system = 0
        def _encodeFilenameFlags(self):
            return self.filename.encode('cp866'), self.flag_bits

    with ZipFile('ex.zip', 'w') as zipf:
        zipf.writestr(ZipInf('Файл'), '123456789'*1024)

It saves dirs and filenames in zip cp866 encoded (here is only ‘Файл’ file).

Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.