Python Compressed file ended before the end-of-stream marker was reached. But file is not Corrupted

Question:

i made a simple request code that downloads a file from a Server


r = requests.get("https:.../index_en.txt.lzma")
index_en= open('C:...index_en.txt.lzma','wb')
index_en.write(r.content)
index_en.close


when i now extract the file manually in the directorry with 7zip everything is fine and the file decrippts as normal.

i tried two ways to do it in a ython programm but scince the file ends with .lzma i guess the following one is a bether approach

import lzma 
with open('C:...index_en.txt.lzma') as compressed:
    print(compressed.readline)
    with lzma.LZMAFile(compressed) as uncompressed:
        for line in uncompressed:
            print(line)

this one gives me the Error: "Compressed file ended before the end-of-stream marker was reached" at the line with the for loop.

the second way i tried was with 7zip, because by hand it worked fine

with py7zr.SevenZipFile("C:...index_en.txt.lzma", 'w') as archive:
    archive.extract(path="C:...Json")

this one gives me the Error: OSError 22 Invalid Argument at the "with py7zr…" line

i really dont understand where the problem here is. WHy does it work by hand but not in python?
Thanks

Asked By: Manu Add1

||

Answers:

You didn’t close your file, so data stuck in user mode buffers isn’t visible on disk until the file is cleaned up at some undetermined future point (may not happen at all, and may not happen until the program exits even if it does). Because of this, any attempt to access the file by any means other than the single handle you wrote to will not see the unflushed data, which would cause it to appear as if the file was truncated, getting the error you observe.

The minimal solution is to actually call close, changing index_en.close to index_en.close(). But practically speaking, you should use with statements for all files (and locks, and socket-like things, and all other resources that require cleanup), whenever possible, so even when an exception occurs the file is definitely closed; it’s most important for files you’re writing to (where data might not get flushed to disk without it), but even for files opened for reading, in pathological cases you can end up hitting the open file handle limit.

Rewriting your first block of code to be completely safe gets you:

with requests.get("https:.../index_en.txt.lzma") as r, open(r'C:...index_en.txt.lzma','wb') as index_en:
    index_en.write(r.content)

Note: request.Response objects are also context managers, so I added it to the with to ensure the underlying connection is released back to the pool promptly. I also prefixed your local path with an r to make it a raw string; on Windows, with backslashes in the path, you always want to do this, so that a file or directory beginning with a character that Python recognizes as a string literal escape doesn’t get corrupted (e.g. "C:foo" is actually "C:<form feed>oo", containing neither a backslash nor an f).

You could even optimize it a bit, in case the file is large, by streaming the data into the file (requiring mostly fixed memory overhead, tied to the buffer size of the underlying connection) rather than fetching eagerly (requiring memory proportionate to file size):

# stream=True means underlying file is opened without being immediately
# read into memory
with requests.get("https:.../index_en.txt.lzma", stream=True) as r, open(r'C:...index_en.txt.lzma','wb') as index_en:
    # iter_content(None) produces an iterator of chunks of data (of whatever size
    # is available in a single system call)
    # Changing to writelines means the iterator is consumed and written
    # as the data arrives
    index_en.writelines(r.iter_content(None))

Controlling the requests.get with a with statement is more important here (as stream=True mode means the underlying socket isn’t consumed and freed immediately).

Also note that print(compressed.readline) is doing nothing (because you didn’t call readline). If there is some line of text in the response prior to the raw LZMA data, you failed to skip it. If there is not such a garbage line, and if you’d called readline properly (with print(compressed.readline())), it would have broken decompression because the file pointer would now have skipped the first few (or many) bytes of the file, landing at some mostly random offset.

Lastly,

with py7zr.SevenZipFile("C:...index_en.txt.lzma", 'w') as archive:
    archive.extract(path="C:...Json")

is wrong because you passed it a mode indicating you’re opening it for write, when you’re clearly attempting to read from it; either omit the 'w' or change it to 'r'.

Answered By: ShadowRanger
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.