Excluding directories in os.walk

Question:

I’m writing a script that descends into a directory tree (using os.walk()) and then visits each file matching a certain file extension. However, since some of the directory trees that my tool will be used on also contain sub directories that in turn contain a LOT of useless (for the purpose of this script) stuff, I figured I’d add an option for the user to specify a list of directories to exclude from the traversal.

This is easy enough with os.walk(). After all, it’s up to me to decide whether I actually want to visit the respective files / dirs yielded by os.walk() or just skip them. The problem is that if I have, for example, a directory tree like this:

root--
     |
     --- dirA
     |
     --- dirB
     |
     --- uselessStuff --
                       |
                       --- moreJunk
                       |
                       --- yetMoreJunk

and I want to exclude uselessStuff and all its children, os.walk() will still descend into all the (potentially thousands of) sub directories of uselessStuff, which, needless to say, slows things down a lot. In an ideal world, I could tell os.walk() to not even bother yielding any more children of uselessStuff, but to my knowledge there is no way of doing that (is there?).

Does anyone have an idea? Maybe there’s a third-party library that provides something like that?

Asked By: antred

||

Answers:

Modifying dirs in-place will prune the (subsequent) files and directories visited by os.walk:

# exclude = set(['New folder', 'Windows', 'Desktop'])
for root, dirs, files in os.walk(top, topdown=True):
    dirs[:] = [d for d in dirs if d not in exclude]

From help(os.walk):

When topdown is true, the caller can modify the dirnames list in-place
(e.g., via del or slice assignment), and walk will only recurse into
the subdirectories whose names remain in dirnames; this can be used to
prune the search…

Answered By: unutbu

… an alternative form of @unutbu’s excellent answer that reads a little more directly, given that the intent is to exclude directories, at the cost of O(n**2) vs O(n) time.

(Making a copy of the dirs list with list(dirs) is required for correct execution)

# exclude = set([...])
for root, dirs, files in os.walk(top, topdown=True):
    [dirs.remove(d) for d in list(dirs) if d in exclude]
Answered By: Dmitri
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.