Why does Python compile modules but not the script being run?

Question:

Why does Python compile libraries that are used in a script, but not the script being called itself?

For instance,

If there is main.py and module.py, and Python is run by doing python main.py, there will be a compiled file module.pyc but not one for main. Why?

  1. If the response is potential disk permissions for the directory of main.py, why does Python compile modules? They are just as likely (if not more likely) to appear in a location where the user does not have write access. Python could compile main if it is writable, or alternatively in another directory.

  2. If the reason is that benefits will be minimal, consider the situation when the script will be used a large number of times (such as in a CGI application).

Asked By: Mike

||

Answers:

Because the script being run may be somewhere where it is inappropriate to generate .pyc files, such as /usr/bin.

Since:

A program doesn’t run any faster when it is read from a .pyc or .pyo file than when it is read from a .py file; the only thing that’s faster about .pyc or .pyo files is the speed with which they are loaded.

That is unnecessary to generate .pyc file for main script. Only the libraries which might be loaded many times should be compiled.

Edited:

It seem you didn’t get my point. First, knowing the whole idea of compiling into .pyc file is to make the same file executing faster at the second time. However, consider if Python did compile the script being run. The interpreter will write bytecode into a .pyc file at the first running, this takes time. So it will even run a bit slower. You might argue that it will run faster after. Well, it just a choice. Plus, as this says:

Explicit is better than implicit.

If one wants a speedup by using .pyc file, one should compile it manually and run the .pyc file explicitly.

Answered By: Kabie

Files are compiled upon import. It isn’t a security thing. It is simply that if you import it python saves the output. See this post by Fredrik Lundh on Effbot.

>>>import main
# main.pyc is created

When running a script python will not use the *.pyc file.
If you have some other reason you want your script pre-compiled you can use the compileall module.

python -m compileall .

compileall Usage

python -m compileall --help
option --help not recognized
usage: python compileall.py [-l] [-f] [-q] [-d destdir] [-x regexp] [directory ...]
-l: don't recurse down
-f: force rebuild even if timestamps are up-to-date
-q: quiet operation
-d destdir: purported directory name for error messages
   if no directory arguments, -l sys.path is assumed
-x regexp: skip files matching the regular expression regexp
   the regexp is searched for in the full path of the file

If the response is potential disk permissions for the directory of main.py, why does Python compile modules?

Modules and scripts are treated the same. Importing is what triggers the output to be saved.

If the reason is that benefits will be minimal, consider the situation when the script will be used a large number of times (such as in a CGI application).

Using compileall does not solve this. Scripts executed by python will not use the *.pyc unless explicitly called. This has negative side effects, well stated by Glenn Maynard in his answer.

The example given of a CGI application should really be addressed by using a technique like FastCGI. If you want to eliminate the overhead of compiling your script you may want eliminate the overhead of starting up python too, not to mention database connection overhead.

A light bootstrap script can be used or even python -c "import script", but these have questionable style.

Answered By: kevpie

To answer your question, reference to 6.1.3. “Compiled” Python files in Python official document.

When a script is run by giving its name on the command line, the bytecode for the script is never written to a .pyc or .pyo file. Thus, the startup time of a script may be reduced by moving most of its code to a module and having a small bootstrap script that imports that module. It is also possible to name a .pyc or .pyo file directly on the command line.

Answered By: Fang-Pen Lin

Nobody seems to want to say this, but I’m pretty sure the answer is simply: there’s no solid reason for this behavior.

All of the reasons given so far are essentially incorrect:

  • There’s nothing special about the main file. It’s loaded as a module, and shows up in sys.modules like any other module. Running a main script is nothing more than importing it with a module name of __main__.
  • There’s no problem with failing to save .pyc files due to read-only directories; Python simply ignores it and moves on.
  • The benefit of caching a script is the same as that of caching any module: not wasting time recompiling the script every time it’s run. The docs acknowledge this explicitly (“Thus, the startup time of a script may be reduced …”).

Another issue to note: if you run python foo.py and foo.pyc exists, it will not be used. You have to explicitly say python foo.pyc. That’s a very bad idea: it means Python won’t automatically recompile the .pyc file when it’s out of sync (due to the .py file changing), so changes to the .py file won’t be used until you manually recompile it. It’ll also fail outright with a RuntimeError if you upgrade Python and the .pyc file format is no longer compatible, which happens regularly. Normally, this is all handled transparently.

You shouldn’t need to move a script to a dummy module and set up a bootstrapping script to trick Python into caching it. That’s a hackish workaround.

The only possible (and very unconvincing) reason I can contrive is to avoid your home directory from being cluttered with a bunch of .pyc files. (This isn’t a real reason; if that was an actual concern, then .pyc files should be saved as dotfiles.) It’s certainly no reason not to even have an option to do this.

Python should definitely be able to cache the main module.

Answered By: Glenn Maynard

Pedagogy

I love and hate questions like this on SO, because there’s a complex mixture of emotion, opinion, and educated guessing going on and people start to get snippy, and somehow everybody loses track of the actual facts and eventually loses track of the original question altogether.

Many technical questions on SO have at least one definitive answer (e.g. an answer that can be verified by execution or an answer that cites an authoritative source) but these “why” questions often do not have just a single, definitive answer. In my mind, there are 2 possible ways to definitively answer a “why” question in computer science:

  1. By pointing to the source code that implements the item of concern. This explains “why” in a technical sense: what preconditions are necessary to evoke this behavior?
  2. By pointing to human-readable artifacts (comments, commit messages, email lists, etc.) written by the developers involved in making that decision. This is the real sense of “why” that I assume the OP is interested in: why did Python’s developers make this seemingly arbitrary decision?

The second type of answer is more difficult to corroborate, since it requires getting in the mind of the developers who wrote the code, especially if there’s no easy-to-find, public documentation explaining a particular decision.

To date, this thread has 7 answers that solely focus on reading the intent of Python’s developers and yet there is only one citation in the whole batch. (And it cites a section of the Python manual that does not answer the OP’s question.)

Here’s my attempt at answering both of the sides of the “why” question along with citations.

Source Code

What are the preconditions that trigger compilation of a .pyc? Let’s look at the source code. (Annoyingly, the Python on GitHub doesn’t have any release tags, so I’ll just tell you that I’m looking at 715a6e.)

There is promising code in import.c:989 in the load_source_module() function. I’ve cut out some bits here for brevity.

static PyObject *
load_source_module(char *name, char *pathname, FILE *fp)
{
    // snip...

    if (/* Can we read a .pyc file? */) {
        /* Then use the .pyc file. */
    }
    else {
        co = parse_source_module(pathname, fp);
        if (co == NULL)
            return NULL;
        if (Py_VerboseFlag)
            PySys_WriteStderr("import %s # from %sn",
                name, pathname);
        if (cpathname) {
            PyObject *ro = PySys_GetObject("dont_write_bytecode");
            if (ro == NULL || !PyObject_IsTrue(ro))
                write_compiled_module(co, cpathname, &st);
        }
    }
    m = PyImport_ExecCodeModuleEx(name, (PyObject *)co, pathname);
    Py_DECREF(co);

    return m;
}

pathname is the path to the module and cpathname is the same path but with a .pyc extension. The only direct logic is the boolean sys.dont_write_bytecode. The rest of the logic is just error handling. So the answer we seek isn’t here, but we can at least see that any code that calls this will result in a .pyc file under most default configurations. The parse_source_module() function has no real relevance to the flow of execution, but I’ll show it here because I’ll come back to it later.

static PyCodeObject *
parse_source_module(const char *pathname, FILE *fp)
{
    PyCodeObject *co = NULL;
    mod_ty mod;
    PyCompilerFlags flags;
    PyArena *arena = PyArena_New();
    if (arena == NULL)
        return NULL;

    flags.cf_flags = 0;

    mod = PyParser_ASTFromFile(fp, pathname, Py_file_input, 0, 0, &flags, 
                   NULL, arena);
    if (mod) {
        co = PyAST_Compile(mod, pathname, NULL, arena);
    }
    PyArena_Free(arena);
    return co;
}

The salient aspect here is that the function parses and compiles a file and returns a pointer to the byte code (if successful).

Now we’re still at a dead end, so let’s approach this from a new angle. How does Python load it’s argument and execute it? In pythonrun.c there are a few functions for loading code from a file and executing it. PyRun_AnyFileExFlags() can handle both interactive and non-interactive file descriptors. For interactive file descriptors, it delegates to PyRun_InteractiveLoopFlags() (this is the REPL) and for non-interactive file descriptors, it delegates to PyRun_SimpleFileExFlags(). PyRun_SimpleFileExFlags() checks if the filename ends in .pyc. If it does, then it calls run_pyc_file() which directly loads compiled byte code from a file descriptor and then runs it.

In the more common case (i.e. .py file as an argument), PyRun_SimpleFileExFlags() calls PyRun_FileExFlags(). This is where we start to find our answer.

PyObject *
PyRun_FileExFlags(FILE *fp, const char *filename, int start, PyObject *globals,
          PyObject *locals, int closeit, PyCompilerFlags *flags)
{
    PyObject *ret;
    mod_ty mod;
    PyArena *arena = PyArena_New();
    if (arena == NULL)
        return NULL;

    mod = PyParser_ASTFromFile(fp, filename, start, 0, 0,
                   flags, NULL, arena);
    if (closeit)
        fclose(fp);
    if (mod == NULL) {
        PyArena_Free(arena);
        return NULL;
    }
    ret = run_mod(mod, filename, globals, locals, flags, arena);
    PyArena_Free(arena);
    return ret;
}

static PyObject *
run_mod(mod_ty mod, const char *filename, PyObject *globals, PyObject *locals,
     PyCompilerFlags *flags, PyArena *arena)
{
    PyCodeObject *co;
    PyObject *v;
    co = PyAST_Compile(mod, filename, flags, arena);
    if (co == NULL)
        return NULL;
    v = PyEval_EvalCode(co, globals, locals);
    Py_DECREF(co);
    return v;
}

The salient point here is that these two functions basically perform the same purpose as the importer’s load_source_module() and parse_source_module(). It calls the parser to create an AST from Python source code and then calls the compiler to create byte code.

So are these blocks of code redundant or do they serve different purposes? The difference is that one block loads a module from a file, while the other block takes a module as an argument. That module argument is — in this case — the __main__ module, which is created earlier in the initialization process using a low-level C function. The __main__ module doesn’t go through most of the normal module import code paths because it is so unique, and as a side effect, it doesn’t go through the code that produces .pyc files.

To summarize: the reason why the __main__ module isn’t compiled to .pyc is that it isn’t “imported”. Yes, it appears in sys.modules, but it gets there via a very different code path than real module imports take.

Developer Intent

Okay, so we can now see that the behavior has more to do with the design of Python than with any clearly expressed rationale in the source code, but that doesn’t answer the question of whether this is an intentional decision or just a side effect that doesn’t bother anybody enough to be worth changing. One of the benefits of open source is that once we’ve found the source code that interests us, we can use the VCS to help trace back to the decisions that led to the present implementation.

One of the pivotal lines of code here (m = PyImport_AddModule("__main__");) dates back to 1990 and was written by the BDFL himself, Guido. It has been modified in intervening years, but the modifications are superficial. When it was first written, the main module for a script argument was initialized like this:

int
run_script(fp, filename)
    FILE *fp;
    char *filename;
{
    object *m, *d, *v;
    m = add_module("`__main__`");
    if (m == NULL)
        return -1;
    d = getmoduledict(m);
    v = run_file(fp, filename, file_input, d, d);
    flushline();
    if (v == NULL) {
        print_error();
        return -1;
    }
    DECREF(v);
    return 0;
}

This existed before .pyc files were even introduced into Python! Small wonder that the design at that time didn’t take compilation into account for script arguments. The commit message enigmatically says:

“Compiling” version

This was one of several dozen commits over a 3 day period… it appears that Guido was deep into some hacking/refactoring and this was the first version that got back to being stable. This commit even predates the creation of the Python-Dev mailing list by about five years!

Saving the compiled bytecode was introduced 6 months later, in 1991.

This still predates the list serve, so we have no real idea of what Guido was thinking. It appears that he simply thought that the importer was the best place to hook into for the purpose of caching bytecodes. Whether he considered the idea of doing the same for __main__ is unclear: either it didn’t occur to him, or else he thought that it was more trouble than it was worth.

I can’t find any bugs on bugs.python.org that are related to caching the bytecodes for the main module, nor can I find any messages on the mailing list about it, so apparently nobody else thinks it’s worth the trouble to try adding it.

To summarize: the reason why all modules are compiled to .pyc except __main__ is that it’s a quirk of history. The design and implementation for how __main__ works was baked into the code before .pyc files even existed. If you want to know more than that, you’ll need to e-mail Guido and ask.

Glenn Maynard’s answer says:

Nobody seems to want to say this, but I’m pretty sure the answer is simply: there’s no solid reason for this behavior.

I agree 100%. There’s circumstantial evidence to support this theory and nobody else in this thread has provided a single shred of evidence to support any other theory. I upvoted Glenn’s answer.

Answered By: Mark E. Haase

Because different versions of Python (3.6, 3.7 …) have different bytecode representations, and trying to design a compile system for that was deemed too complicated. PEP 3147 discusses the rationale.

Answered By: gerardw
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.