What's the fastest way to recursively search for files in python?
Question:
I need to generate a list of files with paths that contain a certain string by recursively searching. I’m doing this currently like this:
for i in iglob(starting_directory+'/**/*', recursive=True):
if filemask in i.split('\')[-1]: # ignore directories that contain the filemask
filelist.append(i)
This works, but when crawling a large directory tree, it’s woefully slow (~10 minutes). We’re on Windows, so doing an external call to the unix find command isn’t an option. My understanding is that glob is faster than os.walk.
Is there a faster way of doing this?
Answers:
Maybe not the answer you were hoping for, but I think these timings are useful. Run on a directory with 15,424 directories totalling 102,799 files (of which 3059 are .py files).
Python 3.6:
import os
import glob
def walk():
pys = []
for p, d, f in os.walk('.'):
for file in f:
if file.endswith('.py'):
pys.append(file)
return pys
def iglob():
pys = []
for file in glob.iglob('**/*', recursive=True):
if file.endswith('.py'):
pys.append(file)
return pys
def iglob2():
pys = []
for file in glob.iglob('**/*.py', recursive=True):
pys.append(file)
return pys
# I also tried pathlib.Path.glob but it was slow and error prone, sadly
%timeit walk()
3.95 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit iglob()
5.01 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit iglob2()
4.36 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using GNU find (4.6.0) on cygwin (4.6.0-1)
Edit: The below is on WINDOWS, on LINUX I found find
to be about 25% faster
$ time find . -name '*.py' > /dev/null
real 0m8.827s
user 0m1.482s
sys 0m7.284s
Seems like os.walk
is as good as you can get on windows.
os.walk() uses scandir which is the fastest and we get the file object that can be used for many other purposes as well like, below I am getting the modified time. Below code implement recursive serach using os.scandir()
import os
import time
def scantree(path):
"""Recursively yield DirEntry objects for given directory."""
for entry in os.scandir(path):
if entry.is_dir(follow_symlinks=False):
yield from scantree(entry.path)
else:
yield entry
for entry in scantree('/home/'):
if entry.is_file():
print(entry.path,time.ctime(entry.stat().st_mtime))
I need to generate a list of files with paths that contain a certain string by recursively searching. I’m doing this currently like this:
for i in iglob(starting_directory+'/**/*', recursive=True):
if filemask in i.split('\')[-1]: # ignore directories that contain the filemask
filelist.append(i)
This works, but when crawling a large directory tree, it’s woefully slow (~10 minutes). We’re on Windows, so doing an external call to the unix find command isn’t an option. My understanding is that glob is faster than os.walk.
Is there a faster way of doing this?
Maybe not the answer you were hoping for, but I think these timings are useful. Run on a directory with 15,424 directories totalling 102,799 files (of which 3059 are .py files).
Python 3.6:
import os
import glob
def walk():
pys = []
for p, d, f in os.walk('.'):
for file in f:
if file.endswith('.py'):
pys.append(file)
return pys
def iglob():
pys = []
for file in glob.iglob('**/*', recursive=True):
if file.endswith('.py'):
pys.append(file)
return pys
def iglob2():
pys = []
for file in glob.iglob('**/*.py', recursive=True):
pys.append(file)
return pys
# I also tried pathlib.Path.glob but it was slow and error prone, sadly
%timeit walk()
3.95 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit iglob()
5.01 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit iglob2()
4.36 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using GNU find (4.6.0) on cygwin (4.6.0-1)
Edit: The below is on WINDOWS, on LINUX I found find
to be about 25% faster
$ time find . -name '*.py' > /dev/null
real 0m8.827s
user 0m1.482s
sys 0m7.284s
Seems like os.walk
is as good as you can get on windows.
os.walk() uses scandir which is the fastest and we get the file object that can be used for many other purposes as well like, below I am getting the modified time. Below code implement recursive serach using os.scandir()
import os
import time
def scantree(path):
"""Recursively yield DirEntry objects for given directory."""
for entry in os.scandir(path):
if entry.is_dir(follow_symlinks=False):
yield from scantree(entry.path)
else:
yield entry
for entry in scantree('/home/'):
if entry.is_file():
print(entry.path,time.ctime(entry.stat().st_mtime))