How can I traverse a file system with a generator?
Question:
I’m trying to create a utility class for traversing all the files in a directory, including those within subdirectories and sub-subdirectories. I tried to use a generator because generators are cool; however, I hit a snag.
def grab_files(directory):
for name in os.listdir(directory):
full_path = os.path.join(directory, name)
if os.path.isdir(full_path):
yield grab_files(full_path)
elif os.path.isfile(full_path):
yield full_path
else:
print('Unidentified name %s. It could be a symbolic link' % full_path)
When the generator reaches a directory, it simply yields the memory location of the new generator; it doesn’t give me the contents of the directory.
How can I make the generator yield the contents of the directory instead of a new generator?
If there’s already a simple library function to recursively list all the files in a directory structure, tell me about it. I don’t intend to replicate a library function.
Answers:
Why reinvent the wheel when you can use os.walk
import os
for root, dirs, files in os.walk(path):
for name in files:
print os.path.join(root, name)
os.walk is a generator that yields the file names in a directory tree by walking the tree either top-down or bottom-up
You can use path.py. Unfortunately the author’s website is no longer around, but you can still download the code from PyPI. This library is a wrapper around path functions in the os
module.
path.py
provides a walkfiles()
method which returns a generator iterating recursively over all files in the directory:
>>> from path import path
>>> print path.walkfiles.__doc__
D.walkfiles() -> iterator over files in D, recursively.
The optional argument, pattern, limits the results to files
with names that match the pattern. For example,
mydir.walkfiles('*.tmp') yields only files with the .tmp
extension.
>>> p = path('/tmp')
>>> p.walkfiles()
<generator object walkfiles at 0x8ca75a4>
>>>
I agree with the os.walk solution
For pure pedantic purpose, try iterate over the generator object, instead of returning it directly:
def grab_files(directory):
for name in os.listdir(directory):
full_path = os.path.join(directory, name)
if os.path.isdir(full_path):
for entry in grab_files(full_path):
yield entry
elif os.path.isfile(full_path):
yield full_path
else:
print('Unidentified name %s. It could be a symbolic link' % full_path)
Starting with Python 3.4, you can use the Pathlib module:
In [48]: def alliter(p):
....: yield p
....: for sub in p.iterdir():
....: if sub.is_dir():
....: yield from alliter(sub)
....: else:
....: yield sub
....:
In [49]: g = alliter(pathlib.Path("."))
In [50]: [next(g) for _ in range(10)]
Out[50]:
[PosixPath('.'),
PosixPath('.pypirc'),
PosixPath('.python_history'),
PosixPath('lshw'),
PosixPath('.gstreamer-0.10'),
PosixPath('.gstreamer-0.10/registry.x86_64.bin'),
PosixPath('.gconf'),
PosixPath('.gconf/apps'),
PosixPath('.gconf/apps/gnome-terminal'),
PosixPath('.gconf/apps/gnome-terminal/%gconf.xml')]
This is essential the object-oriented version of sjthebats answer.
Note that the Path.glob **
pattern returns only directories!
addendum to the answer of gerrit. I wanted to make something more flexible.
list all files in pth
matching a given pattern
, can also list dirs if only_file
is False
from pathlib import Path
def walk(pth=Path('.'), pattern='*', only_file=True) :
""" list all files in pth matching a given pattern, can also list dirs if only_file is False """
if pth.match(pattern) and not (only_file and pth.is_dir()) :
yield pth
for sub in pth.iterdir():
if sub.is_dir():
yield from walk(sub, pattern, only_file)
else:
if sub.match(pattern) :
yield sub
As of Python 3.4, you can use the glob()
method from the built-in pathlib module:
import pathlib
p = pathlib.Path('.')
list(p.glob('**/*')) # lists all files recursively
os.scandir()
is a "function returns directory entries along with file attribute information, giving better performance [than os.listdir()
] for many common use cases." It’s an iterator that does not use os.listdir()
interally.
I’m trying to create a utility class for traversing all the files in a directory, including those within subdirectories and sub-subdirectories. I tried to use a generator because generators are cool; however, I hit a snag.
def grab_files(directory):
for name in os.listdir(directory):
full_path = os.path.join(directory, name)
if os.path.isdir(full_path):
yield grab_files(full_path)
elif os.path.isfile(full_path):
yield full_path
else:
print('Unidentified name %s. It could be a symbolic link' % full_path)
When the generator reaches a directory, it simply yields the memory location of the new generator; it doesn’t give me the contents of the directory.
How can I make the generator yield the contents of the directory instead of a new generator?
If there’s already a simple library function to recursively list all the files in a directory structure, tell me about it. I don’t intend to replicate a library function.
Why reinvent the wheel when you can use os.walk
import os
for root, dirs, files in os.walk(path):
for name in files:
print os.path.join(root, name)
os.walk is a generator that yields the file names in a directory tree by walking the tree either top-down or bottom-up
You can use path.py. Unfortunately the author’s website is no longer around, but you can still download the code from PyPI. This library is a wrapper around path functions in the os
module.
path.py
provides a walkfiles()
method which returns a generator iterating recursively over all files in the directory:
>>> from path import path
>>> print path.walkfiles.__doc__
D.walkfiles() -> iterator over files in D, recursively.
The optional argument, pattern, limits the results to files
with names that match the pattern. For example,
mydir.walkfiles('*.tmp') yields only files with the .tmp
extension.
>>> p = path('/tmp')
>>> p.walkfiles()
<generator object walkfiles at 0x8ca75a4>
>>>
I agree with the os.walk solution
For pure pedantic purpose, try iterate over the generator object, instead of returning it directly:
def grab_files(directory):
for name in os.listdir(directory):
full_path = os.path.join(directory, name)
if os.path.isdir(full_path):
for entry in grab_files(full_path):
yield entry
elif os.path.isfile(full_path):
yield full_path
else:
print('Unidentified name %s. It could be a symbolic link' % full_path)
Starting with Python 3.4, you can use the Pathlib module:
In [48]: def alliter(p):
....: yield p
....: for sub in p.iterdir():
....: if sub.is_dir():
....: yield from alliter(sub)
....: else:
....: yield sub
....:
In [49]: g = alliter(pathlib.Path("."))
In [50]: [next(g) for _ in range(10)]
Out[50]:
[PosixPath('.'),
PosixPath('.pypirc'),
PosixPath('.python_history'),
PosixPath('lshw'),
PosixPath('.gstreamer-0.10'),
PosixPath('.gstreamer-0.10/registry.x86_64.bin'),
PosixPath('.gconf'),
PosixPath('.gconf/apps'),
PosixPath('.gconf/apps/gnome-terminal'),
PosixPath('.gconf/apps/gnome-terminal/%gconf.xml')]
This is essential the object-oriented version of sjthebats answer.
Note that the Path.glob **
pattern returns only directories!
addendum to the answer of gerrit. I wanted to make something more flexible.
list all files in pth
matching a given pattern
, can also list dirs if only_file
is False
from pathlib import Path
def walk(pth=Path('.'), pattern='*', only_file=True) :
""" list all files in pth matching a given pattern, can also list dirs if only_file is False """
if pth.match(pattern) and not (only_file and pth.is_dir()) :
yield pth
for sub in pth.iterdir():
if sub.is_dir():
yield from walk(sub, pattern, only_file)
else:
if sub.match(pattern) :
yield sub
As of Python 3.4, you can use the glob()
method from the built-in pathlib module:
import pathlib
p = pathlib.Path('.')
list(p.glob('**/*')) # lists all files recursively
os.scandir()
is a "function returns directory entries along with file attribute information, giving better performance [than os.listdir()
] for many common use cases." It’s an iterator that does not use os.listdir()
interally.