pathlib.Path().glob() and multiple file extension

Question:

I need to specify multiple file extensions like pathlib.Path(temp_folder).glob('*.xls', '*.txt'):

How I can do it?

https://docs.python.org/dev/library/pathlib.html#pathlib.Path.glob

Asked By: Dmitry Bubnenkov

||

Answers:

If you need to use pathlib.Path.glob()

from pathlib import Path
def get_files(extensions):
    all_files = []
    for ext in extensions:
        all_files.extend(Path('.').glob(ext))
    return all_files

files = get_files(('*.txt', '*.py', '*.cfg'))
Answered By: mattjvincent

You can also use the syntax ** from pathlib which allows you to recursively collect the nested paths.

from pathlib import Path
import re


BASE_DIR = Path('.')
EXTENSIONS = {'.xls', '.txt'}

for path in BASE_DIR.glob(r'**/*'):
    if path.suffix in EXTENSIONS:
        print(path)

If you want to express more logic in your search you can also use a regex as follows:

pattern_sample = re.compile(r'/(([^/]+/)+)(S(d+)_d+).(tif|JPG)')

This pattern will look for all images (tif and JPG) that match S327_008(_flipped)?.tif in my case. Specifically it will collect the sample id and the file name.

Collecting into a set prevents storing duplicates, I found it sometimes useful if you insert more logic and want to ignore different versions of the files (_flipped)

matched_images = set()

for item in BASE_DIR.glob(r'**/*'):
    match = re.match(pattern=pattern_sample, string=str(item))
    if match:
        # retrieve the groups of interest
        filename, sample_id = match.group(3, 4)
        matched_images.add((filename, int(sample_id)))
Answered By: leoburgy

A four-liner solution based on Check if string ends with one of the strings from a list:

folder = '.'
suffixes = ('xls', 'txt')
filter_function = lambda x: x.endswith(suffixes)
list(filter(filter_function, glob(os.path.join(folder, '*'))))
Answered By: itamar kanter

Suppose that the following folder structure is prepared.

folder
├── test1.png
├── test1.txt
├── test1.xls
├── test2.png
├── test2.txt
└── test2.xls

The simple answer using pathlib.Path is as follows.

from pathlib import Path

ext = ['.txt', '.xls']
folder = Path('./folder')

# Get a list of pathlib.PosixPath
path_list = sorted(filter(lambda path: path.suffix in ext, folder.glob('*')))
print(path_list)
# [PosixPath('folder/test1.txt'), PosixPath('folder/test1.xls'), PosixPath('folder/test2.txt'), PosixPath('folder/test2.xls')]

If you want to get the path as a list of strings, you can convert it to a string by using .as_posix().

# Get a list of string paths
path_list = sorted([path.as_posix() for path in filter(lambda path: path.suffix in ext, folder.glob('*'))])
print(path_list)
# ['folder/test1.txt', 'folder/test1.xls', 'folder/test2.txt', 'folder/test2.xls']
Answered By: Keiku

A bit late to the party with a couple of single-line suggestions that don’t require writing a custom function nor the use of a loop and work on Linux:

pathlib.Path.glob() takes interleaved symbols in brackets. For the case of ".txt" and ".xls" suffixes, one could write

files = pathlib.Path('temp_dir').glob('*.[tx][xl][ts]')

If you need to search for ".xlsx" as well, just append the wildcard "*" after the last closing bracket.

files = pathlib.Path('temp_dir').glob('*.[tx][xl][ts]*')

A thing to keep in mind is that the wildcard at the end will be catching not only the "x", but any trailing characters after the last "t" or "s".

Prepending the search pattern with "**/" will do the recursive search as discussed in previous answers.

Answered By: dnt2s
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.