Why would Python tarfile.extract extract a directory with its contents in one environment but only the empty directory in another?

Question:

I am trying to work with tar files using Python in an application running in a Docker container. Since I don’t use tarfile regularly, I drafted some code and ran it in my local Python environment so I could verify that it worked the way I want before I run it in the container.

For simplicity, say I have a directory a/ that contains a file test_file.py, and I tarred up a/ recursively to a tar file file.tar.gz. When I check the contents of the tar file, I see both the directory and the text file (a/ and a/test_file.py). When I extract the directory in my local environment, I get the directory and its contents. When I run the same code in my container, I get only the empty directory with no files. After having this problem, I did some searching and found posts like Extract only a single directory from tar (in python) that recommend explicitly including all files that you want to extract and not assuming that they will come with their parent directory, so I can do that. But…

What’s bothering me is this inconsistent behavior between my two environments! My local environment is using Python 3.9.15 on Ubuntu 22.04. My Docker container is using Python 3.10.9 on Ubuntu 18.04. So the environments are different, but it seems weird to have this discrepancy still.

Here’s some sample code. I know I mentioned using the extract method earlier. I tried using that first, then switched to extractall with the members kwarg later, and I get the same behavior — the file comes too in my local environment but not in the Docker container.

import logging
import os
from pathlib import Path
import tarfile
import tempfile


def main():
    with tempfile.TemporaryDirectory() as temp_dir:
        # make a tar file with one directory that contains one file
        file_path = Path(temp_dir) / "file.tar.gz"
        os.chdir(temp_dir)
        logging.info("Changed dir to %s", os.getcwd())
        subdir = "a"
        os.mkdir(subdir)
        with open(Path(subdir) / "test_file.py", "w") as file_obj:
            file_obj.write("import thisn")
        with tarfile.open(file_path, "w:gz") as tar:
            tar.add(subdir, recursive=True)

        # unpack tar file
        with tarfile.open(file_path) as tar:
            file_names = tar.getnames()
            logging.info(file_names)
            members = [tar.getmember(subdir)]
            tar.extractall(path=temp_dir, members=members)

        # check contents of extracted directory
        dir_path = Path(temp_dir) / subdir
        output = os.listdir(dir_path)
        logging.info("%s contents: %s", dir_path, output)

Logging shows that the directory and the file are in the tar file (from call to tar.getnames()) in both environments, but the output of the last listdir call is the file in one environment and an empty list in the other.

Asked By: Cheryl Danner

||

Answers:

There are two problems. I assume on the environment you got only an empty subdirectory you only ran the unpacking part.

  1. You never remove the original files, and you extract into the same directory. Thus, when you unpack, it is not possible to distinguish whether the files you see on the disk are the result of unpacking, or the original files. Putting shutil.rmtree(subdir) after you create the tarball should solve this.

  2. Once you solve the first problem, you will see that the result is only ever creating the empty directory. This is because you explicitly request it: your members is only ["a"] (or rather the TarInfo version of it), and thus only ["a"] is extracted, just like your link warned you. Removing the members=members parameter, or using members = [tar.getmember("a"), tar.getmember("a/test_file.py")], will get you the desired result. Even [tar.getmember("a/test_file.py")] will be fine: a directory will be created for you in this case even though it is not listed for extraction.)

Answered By: Amadan
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.