Get directories only with glob pattern using pathlib

Question:

I want to use pathlib.glob() to find directories with a specific name pattern (*data) in the current working dir. I don’t want to explicitly check via .isdir() or something else.

Input data

This is the relevant listing with three folders as the expected result and one file with the same pattern but that should be part of the result.

ls -ld *data
drwxr-xr-x 2 user user 4,0K  9. Sep 10:22 2021-02-11_68923_data/
drwxr-xr-x 2 user user 4,0K  9. Sep 10:22 2021-04-03_38923_data/
drwxr-xr-x 2 user user 4,0K  9. Sep 10:22 2022-01-03_38923_data/
-rw-r--r-- 1 user user    0  9. Sep 10:24 2011-12-43_3423_data

Expected result

[
    '2021-02-11_68923_data/', 
    '2021-04-03_38923_data/',
    '2022-01-03_38923_data/'
]

Minimal working example

from pathlib import Path
cwd = Path.cwd()

result = cwd.glob('*_data/')
result = list(result)

That gives me the 3 folders but also the file.

Also tried the variant cwd.glob('**/*_data/').

Asked By: buhtz

||

Answers:

glob is insufficient here. From the filesystem’s perspective, the directory’s name really is "2021-02-11_68923_data", not "2021-02-11_68923_data/". Since glob only looks at names, it cannot differentiate between "regular" files and directories, and you’d have to add some additional check, such as isdir that you mentioned.

Answered By: Mureinik

The trailing path separator certainly should be respected in pathlib.glob patterns. This is the expected behaviour in shells on all platforms, and is also how the glob module works:

If the pattern is followed by an os.sep or os.altsep then files will not match.

However, there is a bug in pathlib that was fixed in bpo-22276, and merged in Python-3.11.0rc1 (see what’s new: pathlib).

In the meantime, as a work-around you can use the glob module to get the behaviour you want:

$ ls -ld *data
drwxr-xr-x 2 user user 4096 Sep  9 22:45 2022-01-03_38923_data
drwxr-xr-x 2 user user 4096 Sep  9 22:44 2021-04-03_38923_data
drwxr-xr-x 2 user user 4096 Sep  9 22:44 2021-02-11_68923_data
-rw-r--r-- 1 user user    0 Sep  9 22:45 2011-12-43_3423_data
>>> import glob
>>> res = glob.glob('*_data')
>>> print('n'.join(res))
2022-01-03_38923_data
2011-12-43_3423_data
2021-02-11_68923_data
2021-04-03_38923_data
>>> res = glob.glob('*_data/')
>>> print('n'.join(res))
2022-01-03_38923_data/
2021-02-11_68923_data/
2021-04-03_38923_data/
Answered By: ekhumoro