How can I search sub-folders using glob.glob module?
Question:
I want to open a series of subfolders in a folder and find some text files and print some lines of the text files. I am using this:
configfiles = glob.glob('C:/Users/sam/Desktop/file1/*.txt')
But this cannot access the subfolders as well. Does anyone know how I can use the same command to access subfolders as well?
Answers:
In Python 3.5 and newer use the new recursive **/
functionality:
configfiles = glob.glob('C:/Users/sam/Desktop/file1/**/*.txt', recursive=True)
When recursive
is set, **
followed by a path separator matches 0 or more subdirectories.
In earlier Python versions, glob.glob()
cannot list files in subdirectories recursively.
In that case I’d use os.walk()
combined with fnmatch.filter()
instead:
import os
import fnmatch
path = 'C:/Users/sam/Desktop/file1'
configfiles = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in fnmatch.filter(files, '*.txt')]
This’ll walk your directories recursively and return all absolute pathnames to matching .txt
files. In this specific case the fnmatch.filter()
may be overkill, you could also use a .endswith()
test:
import os
path = 'C:/Users/sam/Desktop/file1'
configfiles = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in files if f.endswith('.txt')]
To find files in immediate subdirectories:
configfiles = glob.glob(r'C:UserssamDesktop**.txt')
For a recursive version that traverse all subdirectories, you could use **
and pass recursive=True
since Python 3.5:
configfiles = glob.glob(r'C:UserssamDesktop***.txt', recursive=True)
Both function calls return lists. You could use glob.iglob()
to return paths one by one. Or use pathlib
:
from pathlib import Path
path = Path(r'C:UserssamDesktop')
txt_files_only_subdirs = path.glob('*/*.txt')
txt_files_all_recursively = path.rglob('*.txt') # including the current dir
Both methods return iterators (you can get paths one by one).
You can use Formic with Python 2.6
import formic
fileset = formic.FileSet(include="**/*.txt", directory="C:/Users/sam/Desktop/")
Disclosure – I am the author of this package.
The glob2 package supports wild cards and is reasonably fast
code = '''
import glob2
glob2.glob("files/*/**")
'''
timeit.timeit(code, number=1)
On my laptop it takes approximately 2 seconds to match >60,000 file paths.
Here is a adapted version that enables glob.glob
like functionality without using glob2
.
def find_files(directory, pattern='*'):
if not os.path.exists(directory):
raise ValueError("Directory not found {}".format(directory))
matches = []
for root, dirnames, filenames in os.walk(directory):
for filename in filenames:
full_path = os.path.join(root, filename)
if fnmatch.filter([full_path], pattern):
matches.append(os.path.join(root, filename))
return matches
So if you have the following dir structure
tests/files
├── a0
│ ├── a0.txt
│ ├── a0.yaml
│ └── b0
│ ├── b0.yaml
│ └── b00.yaml
└── a1
You can do something like this
files = utils.find_files('tests/files','**/b0/b*.yaml')
> ['tests/files/a0/b0/b0.yaml', 'tests/files/a0/b0/b00.yaml']
Pretty much fnmatch
pattern match on the whole filename itself, rather than the filename only.
As pointed out by Martijn, glob can only do this through the **
operator introduced in Python 3.5. Since the OP explicitly asked for the glob module, the following will return a lazy evaluation iterator that behaves similarly
import os, glob, itertools
configfiles = itertools.chain.from_iterable(glob.iglob(os.path.join(root,'*.txt'))
for root, dirs, files in os.walk('C:/Users/sam/Desktop/file1/'))
Note that you can only iterate once over configfiles
in this approach though. If you require a real list of configfiles that can be used in multiple operations you would have to create this explicitly by using list(configfiles)
.
configfiles = glob.glob('C:/Users/sam/Desktop/**/*.txt")
Doesn’t works for all cases, instead use glob2
configfiles = glob2.glob('C:/Users/sam/Desktop/**/*.txt")
If you can install glob2 package…
import glob2
filenames = glob2.glob("C:\top_directory\**\*.ext") # Where ext is a specific file extension
folders = glob2.glob("C:\top_directory\**\")
All filenames and folders:
all_ff = glob2.glob("C:\top_directory\**\**")
If you’re running Python 3.4+, you can use the pathlib
module. The Path.glob()
method supports the **
pattern, which means “this directory and all subdirectories, recursively”. It returns a generator yielding Path
objects for all matching files.
from pathlib import Path
configfiles = Path("C:/Users/sam/Desktop/file1/").glob("**/*.txt")
There’s a lot of confusion on this topic. Let me see if I can clarify it (Python 3.7):
glob.glob('*.txt') :
matches all files ending in ‘.txt’ in current directory
glob.glob('*/*.txt') :
same as 1
glob.glob('**/*.txt') :
matches all files ending in ‘.txt’ in the immediate subdirectories only, but not in the current directory
glob.glob('*.txt',recursive=True) :
same as 1
glob.glob('*/*.txt',recursive=True) :
same as 3
glob.glob('**/*.txt',recursive=True):
matches all files ending in ‘.txt’ in the current directory and in all subdirectories
So it’s best to always specify recursive=True.
The command rglob
will do an infinite recursion down the deepest sub-level of your directory structure. If you only want one level deep, then do not use it, however.
I realize the OP was talking about using glob.glob. I believe this answers the intent, however, which is to search all subfolders recursively.
The rglob
function recently produced a 100x increase in speed for a data processing algorithm which was using the folder structure as a fixed assumption for the order of data reading. However, with rglob
we were able to do a single scan once through all files at or below a specified parent directory, save their names to a list (over a million files), then use that list to determine which files we needed to open at any point in the future based on the file naming conventions only vs. which folder they were in.
You can use the function glob.glob()
or glob.iglob()
directly from glob module to retrieve paths recursively from inside the directories/files and subdirectories/subfiles.
Syntax:
glob.glob(pathname, *, recursive=False) # pathname = '/path/to/the/directory' or subdirectory
glob.iglob(pathname, *, recursive=False)
In your example, it is possible to write like this:
import glob
import os
configfiles = [f for f in glob.glob("C:/Users/sam/Desktop/*.txt")]
for f in configfiles:
print(f'Filename with path: {f}')
print(f'Only filename: {os.path.basename(f)}')
print(f'Filename without extensions: {os.path.splitext(os.path.basename(f))[0]}')
Output:
Filename with path: C:/Users/sam/Desktop/test_file.txt
Only filename: test_file.txt
Filename without extensions: test_file
Help:
Documentation for os.path.splitext
and documentation for os.path.basename
.
(The first options are of course mentioned in other answers, here the goal is to show that glob uses os.scandir
internally, and provide a direct answer with this).
Using glob
As explained before, with Python 3.5+, it’s easy:
import glob
for f in glob.glob('d:/temp/**/*', recursive=True):
print(f)
#d:tempNew folder
#d:tempNew Text Document - Copy.txt
#d:tempNew folderNew Text Document - Copy.txt
#d:tempNew folderNew Text Document.txt
Using pathlib
from pathlib import Path
for f in Path('d:/temp').glob('**/*'):
print(f)
Using os.scandir
os.scandir
is what glob
does internally. So here is how to do it directly, with a use of yield
:
def listpath(path):
for f in os.scandir(path):
f2 = os.path.join(path, f)
if os.path.isdir(f):
yield f2
yield from listpath(f2)
else:
yield f2
for f in listpath('d:\temp'):
print(f)
I want to open a series of subfolders in a folder and find some text files and print some lines of the text files. I am using this:
configfiles = glob.glob('C:/Users/sam/Desktop/file1/*.txt')
But this cannot access the subfolders as well. Does anyone know how I can use the same command to access subfolders as well?
In Python 3.5 and newer use the new recursive **/
functionality:
configfiles = glob.glob('C:/Users/sam/Desktop/file1/**/*.txt', recursive=True)
When recursive
is set, **
followed by a path separator matches 0 or more subdirectories.
In earlier Python versions, glob.glob()
cannot list files in subdirectories recursively.
In that case I’d use os.walk()
combined with fnmatch.filter()
instead:
import os
import fnmatch
path = 'C:/Users/sam/Desktop/file1'
configfiles = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in fnmatch.filter(files, '*.txt')]
This’ll walk your directories recursively and return all absolute pathnames to matching .txt
files. In this specific case the fnmatch.filter()
may be overkill, you could also use a .endswith()
test:
import os
path = 'C:/Users/sam/Desktop/file1'
configfiles = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in files if f.endswith('.txt')]
To find files in immediate subdirectories:
configfiles = glob.glob(r'C:UserssamDesktop**.txt')
For a recursive version that traverse all subdirectories, you could use **
and pass recursive=True
since Python 3.5:
configfiles = glob.glob(r'C:UserssamDesktop***.txt', recursive=True)
Both function calls return lists. You could use glob.iglob()
to return paths one by one. Or use pathlib
:
from pathlib import Path
path = Path(r'C:UserssamDesktop')
txt_files_only_subdirs = path.glob('*/*.txt')
txt_files_all_recursively = path.rglob('*.txt') # including the current dir
Both methods return iterators (you can get paths one by one).
You can use Formic with Python 2.6
import formic
fileset = formic.FileSet(include="**/*.txt", directory="C:/Users/sam/Desktop/")
Disclosure – I am the author of this package.
The glob2 package supports wild cards and is reasonably fast
code = '''
import glob2
glob2.glob("files/*/**")
'''
timeit.timeit(code, number=1)
On my laptop it takes approximately 2 seconds to match >60,000 file paths.
Here is a adapted version that enables glob.glob
like functionality without using glob2
.
def find_files(directory, pattern='*'):
if not os.path.exists(directory):
raise ValueError("Directory not found {}".format(directory))
matches = []
for root, dirnames, filenames in os.walk(directory):
for filename in filenames:
full_path = os.path.join(root, filename)
if fnmatch.filter([full_path], pattern):
matches.append(os.path.join(root, filename))
return matches
So if you have the following dir structure
tests/files
├── a0
│ ├── a0.txt
│ ├── a0.yaml
│ └── b0
│ ├── b0.yaml
│ └── b00.yaml
└── a1
You can do something like this
files = utils.find_files('tests/files','**/b0/b*.yaml')
> ['tests/files/a0/b0/b0.yaml', 'tests/files/a0/b0/b00.yaml']
Pretty much fnmatch
pattern match on the whole filename itself, rather than the filename only.
As pointed out by Martijn, glob can only do this through the **
operator introduced in Python 3.5. Since the OP explicitly asked for the glob module, the following will return a lazy evaluation iterator that behaves similarly
import os, glob, itertools
configfiles = itertools.chain.from_iterable(glob.iglob(os.path.join(root,'*.txt'))
for root, dirs, files in os.walk('C:/Users/sam/Desktop/file1/'))
Note that you can only iterate once over configfiles
in this approach though. If you require a real list of configfiles that can be used in multiple operations you would have to create this explicitly by using list(configfiles)
.
configfiles = glob.glob('C:/Users/sam/Desktop/**/*.txt")
Doesn’t works for all cases, instead use glob2
configfiles = glob2.glob('C:/Users/sam/Desktop/**/*.txt")
If you can install glob2 package…
import glob2
filenames = glob2.glob("C:\top_directory\**\*.ext") # Where ext is a specific file extension
folders = glob2.glob("C:\top_directory\**\")
All filenames and folders:
all_ff = glob2.glob("C:\top_directory\**\**")
If you’re running Python 3.4+, you can use the pathlib
module. The Path.glob()
method supports the **
pattern, which means “this directory and all subdirectories, recursively”. It returns a generator yielding Path
objects for all matching files.
from pathlib import Path
configfiles = Path("C:/Users/sam/Desktop/file1/").glob("**/*.txt")
There’s a lot of confusion on this topic. Let me see if I can clarify it (Python 3.7):
glob.glob('*.txt') :
matches all files ending in ‘.txt’ in current directoryglob.glob('*/*.txt') :
same as 1glob.glob('**/*.txt') :
matches all files ending in ‘.txt’ in the immediate subdirectories only, but not in the current directoryglob.glob('*.txt',recursive=True) :
same as 1glob.glob('*/*.txt',recursive=True) :
same as 3glob.glob('**/*.txt',recursive=True):
matches all files ending in ‘.txt’ in the current directory and in all subdirectories
So it’s best to always specify recursive=True.
The command rglob
will do an infinite recursion down the deepest sub-level of your directory structure. If you only want one level deep, then do not use it, however.
I realize the OP was talking about using glob.glob. I believe this answers the intent, however, which is to search all subfolders recursively.
The rglob
function recently produced a 100x increase in speed for a data processing algorithm which was using the folder structure as a fixed assumption for the order of data reading. However, with rglob
we were able to do a single scan once through all files at or below a specified parent directory, save their names to a list (over a million files), then use that list to determine which files we needed to open at any point in the future based on the file naming conventions only vs. which folder they were in.
You can use the function glob.glob()
or glob.iglob()
directly from glob module to retrieve paths recursively from inside the directories/files and subdirectories/subfiles.
Syntax:
glob.glob(pathname, *, recursive=False) # pathname = '/path/to/the/directory' or subdirectory
glob.iglob(pathname, *, recursive=False)
In your example, it is possible to write like this:
import glob
import os
configfiles = [f for f in glob.glob("C:/Users/sam/Desktop/*.txt")]
for f in configfiles:
print(f'Filename with path: {f}')
print(f'Only filename: {os.path.basename(f)}')
print(f'Filename without extensions: {os.path.splitext(os.path.basename(f))[0]}')
Output:
Filename with path: C:/Users/sam/Desktop/test_file.txt
Only filename: test_file.txt
Filename without extensions: test_file
Help:
Documentation for os.path.splitext
and documentation for os.path.basename
.
(The first options are of course mentioned in other answers, here the goal is to show that glob uses os.scandir
internally, and provide a direct answer with this).
Using glob
As explained before, with Python 3.5+, it’s easy:
import glob
for f in glob.glob('d:/temp/**/*', recursive=True):
print(f)
#d:tempNew folder
#d:tempNew Text Document - Copy.txt
#d:tempNew folderNew Text Document - Copy.txt
#d:tempNew folderNew Text Document.txt
Using pathlib
from pathlib import Path
for f in Path('d:/temp').glob('**/*'):
print(f)
Using os.scandir
os.scandir
is what glob
does internally. So here is how to do it directly, with a use of yield
:
def listpath(path):
for f in os.scandir(path):
f2 = os.path.join(path, f)
if os.path.isdir(f):
yield f2
yield from listpath(f2)
else:
yield f2
for f in listpath('d:\temp'):
print(f)