Do I understand os.walk right?

Question:

The loop for root, dir, file in os.walk(startdir) works through these steps?

for root in os.walk(startdir) 
    for dir in root 
        for files in dir
  1. get root of start dir : C:dir1dir2startdir

  2. get folders in C:dir1dir2startdir and return list of folders "dirlist"

  3. get files in the first dirlist item and return the list of files "filelist" as the first item of a list of filelists.

  4. move to the second item in dirlist and return the list of files in this folder "filelist2" as the second item of a list of filelists. etc.

  5. move to the next root in the folder tree and start from 2. etc.

Right? Or does it just get all roots first, then all dirs second, and all files third?

Asked By: Baf

||

Answers:

os.walk works a little differently than above. Basically, it returns tuples of (path, directories, files). To see this, try the following:

import pprint
import os
pp=pprint.PrettyPrinter(indent=4)
for dir_tuple in os.walk("/root"):
    pp.pprint(dir_tuple)

…you’ll see that each iteration of the loop will print a directory name, a list the names of any directories immediately within that directory, and another list of all files within that directory. os.walk will then enter each directory in the list of subdirectories and do the same thing, until all subdirectories of the original root have been traversed. It may help to learn a little about recursion to understand how this works.

Answered By: Alex Westholm

Here’s a short example of how os.walk() works along with some explanation using a few os functions.

First note that os.walk() returns three items, the root directory, a list of directories (dirs) immediately below the current root and a list of files found in those directories. The documentation will give you more information.

dirs will contain a list of directories just below root, and files will contain a list of all the files found in those directories. In the next iteration, each directory of those in the previous dirs list will take on the role of root in turn and the search will continue from there, going down a level only after the current level has been searched.

A code example: This will search for, count and print the names of .jpg and .gif files below the specified search directory (your root). It also makes use of the os.path.splitext() function to separate the base of the file from its extension and the os.path.join() function to give you the full name including path of the image files found.

import os

searchdir = r'C:your_root_dir'  # your search starts in this directory (your root) 

count = 0
for root, dirs, files in os.walk(searchdir):
    for name in files:
        (base, ext) = os.path.splitext(name) # split base and extension
        if ext in ('.jpg', '.gif'):          # check the extension
            count += 1
            full_name = os.path.join(root, name) # create full path
            print(full_name)

print('ntotal number of .jpg and .gif files found: %d' % count)
Answered By: Levon

os.walk returns a generator, that creates a tuple of values (current_path, directories in current_path, files in current_path).

Every time the generator is called it will follow each directory recursively until no further sub-directories are available from the initial directory that walk was called upon.

As such,

os.walk('C:dir1dir2startdir').next()[0] # returns 'C:dir1dir2startdir'
os.walk('C:dir1dir2startdir').next()[1] # returns all the dirs in 'C:dir1dir2startdir'
os.walk('C:dir1dir2startdir').next()[2] # returns all the files in 'C:dir1dir2startdir'

So

import os.path
....
for path, directories, files in os.walk('C:dir1dir2startdir'):
     if file in files:
          print('found %s' % os.path.join(path, file))

or this

def search_file(directory = None, file = None):
    assert os.path.isdir(directory)
    for cur_path, directories, files in os.walk(directory):
        if file in files:
            return os.path.join(directory, cur_path, file)
    return None

or if you want to look for file you can do this:

import os
def search_file(directory = None, file = None):
    assert os.path.isdir(directory)
    current_path, directories, files = os.walk(directory).next()
    if file in files:
        return os.path.join(directory, file)
    elif directories == '':
        return None
    else:
        for new_directory in directories:
            result = search_file(directory = os.path.join(directory, new_directory), file = file)
            if result:
                return result
        return None
Answered By: Samy Vilar

In simple words os.walk() will generate tuple of path,folders,files present in given path and will keep on traversing the subfolders.

import os.path
path=input(" enter the pathn")
for path,subdir,files in os.walk(path):
   for name in subdir:
       print os.path.join(path,name) # will print path of directories
   for name in files:    
       print os.path.join(path,name) # will print path of files

this will generate paths of all sub directories,files and files in sub directories

Answered By: shadow0359

Minimal runnable example

This is how I like to learn stuff:

mkdir root
cd root
mkdir 
  d0 
  d1 
  d0/d0_d1
touch 
  f0 
  d0/d0_f0 
  d0/d0_f1 
  d0/d0_d1/d0_d1_f0
tree

Output:

.
├── d0
│   ├── d0_d1
│   │   └── d0_d1_f0
│   ├── d0_f0
│   └── d0_f1
├── d1
└── f0

main.py

#!/usr/bin/env python3
import os
for path, dirnames, filenames in os.walk('root'):
    print('{} {} {}'.format(repr(path), repr(dirnames), repr(filenames)))

Output:

'root' ['d0', 'd1'] ['f0']
'root/d0' ['d0_d1'] ['d0_f0', 'd0_f1']
'root/d0/d0_d1' [] ['d0_d1_f0']
'root/d1' [] []

This makes everything clear:

  • path is the root directory of each step
  • dirnames is a list of directory basenames in each path
  • filenames is a list of file basenames in each path

Tested on Ubuntu 16.04, Python 3.5.2.

Modifying dirnames changes the tree recursion

This is basically the only other thing you have to keep in mind.

E.g., if you do the following operations on dirnames, it affects the traversal:

Walk file or directory

If the input to traverse is either a file or directory, you can handle it like this:

#!/usr/bin/env python3

import os
import sys

def walk_file_or_dir(root):
    if os.path.isfile(root):
        dirname, basename = os.path.split(root)
        yield dirname, [], [basename]
    else:
        for path, dirnames, filenames in os.walk(root):
            yield path, dirnames, filenames

for path, dirnames, filenames in walk_file_or_dir(sys.argv[1]):
    print(path, dirnames, filenames)

My answer is very basic and plain. I am a beginner myself and found out my answers searching the web (see esp. the good documentation at docs.python.org) and trying some test code, such as this one:

for root, dirs, files in os.walk(startdir)
    print ("__________________")
    print (root)
    for file in files:
        print ("---",file)

This prints out the directory tree, where each dir—the starting dir and the included subdirs—is preceded by a line and followed by the files contained in it.

I think you have to keep in mind two things:

(1) os.walk generates a 3-tuple (a triple) <root,dirs,filenames> where

  • root is a string containing the name of the root dir;

  • dirs is a list of strings: the directory names directly contained in root, that is, at the first level, without the subdirs possibly included in
    them;

  • filenames is a list of strings: the filenames directly contained in root.

(2) a for loop such as

for root, subdirs, files in os.walk(YourStartDir)

loops through root dir and all of its subdirs. It doesn’t take a step for each file; it just scans the directory tree and at each step (for each dir in the tree) it fills up the list of the file names contained in it and the list of subdirs directly contained in it. If you have n dirs (including root and its subdirs), the for loop loops n times, i.e. it takes n steps. You can write a short bit of test code to check this, e.g. using a counter.
At each step, it generates a 3-tuple: a string plus two (possibly empty) lists of strings.
In this example the elements of the 3-tuple are called: "root", "subdirs", "files", but these names are up to you; if your code is

for a, b, c in os.walk(startdir)

the elements of the 3-tuple will be called "a", "b", "c".

Let’s go back to the test code:

for root, dirs, files in os.walk(startdir)
    print ("__________________")
    print (root)
    for file in files:
        print ("---",file)

First loop: root is the dir you have given in input (the starting path, the starting dir: a string), dirs is the list of the included subdirectories names (but not of the names of the dirs included in them), files is the list of the included files. In the test code we are not using the list "dirs".

Second loop: root is now the first subdir, dirs is a list of the subdirs included in it, files is a list of the files included in it.

and so on, until you reach the last subdir in the tree.

There are three optional arguments to os.walk: you can find lots of info about them and their use on the web, but I think that your question is about the basics of os.walk.

Answered By: archie
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.