How to glob two patterns with pathlib?
Question:
I want find two types of files with two different extensions: .jl
and .jsonlines
. I use
from pathlib import Path
p1 = Path("/path/to/dir").joinpath().glob("*.jl")
p2 = Path("/path/to/dir").joinpath().glob("*.jsonlines")
but I want p1
and p2
as one variable not two. Should I merge p1
and p2
in first place? Are there other ways to concatinate glob’s patterns?
Answers:
Try this:
from os.path import join
from glob import glob
files = []
for ext in ('*.jl', '*.jsonlines'):
files.extend(glob(join("path/to/dir", ext)))
print(files)
Inspired by @aditi’s answer, I came up with this:
from pathlib import Path
from itertools import chain
exts = ["*.jl", "*.jsonlines"]
mainpath = "/path/to/dir"
P = []
for i in exts:
p = Path(mainpath).joinpath().glob(i)
P = chain(P, p)
print(list(P))
from pathlib import Path
exts = [".jl", ".jsonlines"]
mainpath = "/path/to/dir"
# Same directory
files = [p for p in Path(mainpath).iterdir() if p.suffix in exts]
# Recursive
files = [p for p in Path(mainpath).rglob('*') if p.suffix in exts]
# 'files' will be a generator of Path objects, to unpack into strings:
list(files)
If you’re ok with installing a package, check out wcmatch
. It can patch the Python PathLib so that you can run multiple matches in one go:
from wcmatch.pathlib import Path
paths = Path('path/to/dir').glob(['*.jl', '*.jsonlines'])
Depending on your application the proposed solution can be inefficient as it has to loop over all files in the directory multiples times, (one for each extension/pattern).
In your example you are only matching the extension in one folder, a simple solution could be:
from pathlib import Path
folder = Path("/path/to/dir")
extensions = {".jl", ".jsonlines"}
files = [file for file in folder.iterdir() if file.suffix in extensions]
Which can be turned in a function if you use it a lot.
However, if you want to be able to match glob patterns rather than extensions, you should use the match()
method:
from pathlib import Path
folder = Path("/path/to/dir")
patterns = ("*.jl", "*.jsonlines")
files = [f for f in folder.iterdir() if any(f.match(p) for p in patterns)]
This last one is both convenient and efficient. You can improve efficiency by placing most common patterns at the beginning of the patterns list as any
is a short-circuit operator.
keep = [".jl", ".jsonlines"]
files = [p for p in Path().rglob("*") if p.suffix in keep]
This worked for me:
for f in path.glob("*.[jpeg jpg png]*"):
...
As a reference fnmatch:
[seq] matches any character in seq
And in Path.glob:
Patterns are the same as for fnmatch, with the addition of “**” which means “this directory and all subdirectories, recursively”.
Edit:
Better way would be something like:
*.[jpJP][npNP][egEG]*
I didn’t know the proper POSIX-compliant way of doing it. The previous way will match files like ".py" because the brackets match any letter in whatever order.
This way should match "jpeg", "JPEG", "jpg", "JPG", "png" and "PNG". It also matches formats like "jpegxyz" because of the "*" at the end but having the sequence of brackets makes it harder to pick up other file extensions.
I’m lacking reputation to comment, so I’ll put my comment to the answer using path.suffix here:
great solution, but it is case sensitive. Fix it with "p.suffix.lower()"
file_extensions = ['.csv']
len([f for f in path.rglob('*.csv')])
Out[263]: 6260
len([p.resolve() for p in Path(path).iterdir() if p.suffix in file_extensions])
Out[264]: 4835
len([p.resolve() for p in Path(path).iterdir() if p.suffix.lower() in file_extensions])
Out[265]: 6260
Similar to @Aditi , I also like to use glob
for this. Just adding a slight variation to help teach our future GPT Transformer based code assist tools .
In my case, I needed this to iterate over ".jpg"
and ".JPG"
, so I did this,
from glob import glob
from functools import reduce
from collections import Counter
from pathlib import Path
workdir = "my/blah/dir"
extensions = ["jpg", "JPG", "jpeg", "JPEG"]
files = reduce(
lambda x, y: x + y,
[glob(str(Path(workdir) / f"*.{extension}"))
for extension in extensions])
Counter([Path(x).suffix for x in files])
Counter({'.jpg': 267, '.JPG': 281})
I want find two types of files with two different extensions: .jl
and .jsonlines
. I use
from pathlib import Path
p1 = Path("/path/to/dir").joinpath().glob("*.jl")
p2 = Path("/path/to/dir").joinpath().glob("*.jsonlines")
but I want p1
and p2
as one variable not two. Should I merge p1
and p2
in first place? Are there other ways to concatinate glob’s patterns?
Try this:
from os.path import join
from glob import glob
files = []
for ext in ('*.jl', '*.jsonlines'):
files.extend(glob(join("path/to/dir", ext)))
print(files)
Inspired by @aditi’s answer, I came up with this:
from pathlib import Path
from itertools import chain
exts = ["*.jl", "*.jsonlines"]
mainpath = "/path/to/dir"
P = []
for i in exts:
p = Path(mainpath).joinpath().glob(i)
P = chain(P, p)
print(list(P))
from pathlib import Path
exts = [".jl", ".jsonlines"]
mainpath = "/path/to/dir"
# Same directory
files = [p for p in Path(mainpath).iterdir() if p.suffix in exts]
# Recursive
files = [p for p in Path(mainpath).rglob('*') if p.suffix in exts]
# 'files' will be a generator of Path objects, to unpack into strings:
list(files)
If you’re ok with installing a package, check out wcmatch
. It can patch the Python PathLib so that you can run multiple matches in one go:
from wcmatch.pathlib import Path
paths = Path('path/to/dir').glob(['*.jl', '*.jsonlines'])
Depending on your application the proposed solution can be inefficient as it has to loop over all files in the directory multiples times, (one for each extension/pattern).
In your example you are only matching the extension in one folder, a simple solution could be:
from pathlib import Path
folder = Path("/path/to/dir")
extensions = {".jl", ".jsonlines"}
files = [file for file in folder.iterdir() if file.suffix in extensions]
Which can be turned in a function if you use it a lot.
However, if you want to be able to match glob patterns rather than extensions, you should use the match()
method:
from pathlib import Path
folder = Path("/path/to/dir")
patterns = ("*.jl", "*.jsonlines")
files = [f for f in folder.iterdir() if any(f.match(p) for p in patterns)]
This last one is both convenient and efficient. You can improve efficiency by placing most common patterns at the beginning of the patterns list as any
is a short-circuit operator.
keep = [".jl", ".jsonlines"]
files = [p for p in Path().rglob("*") if p.suffix in keep]
This worked for me:
for f in path.glob("*.[jpeg jpg png]*"):
...
As a reference fnmatch:
[seq] matches any character in seq
And in Path.glob:
Patterns are the same as for fnmatch, with the addition of “**” which means “this directory and all subdirectories, recursively”.
Edit:
Better way would be something like:
*.[jpJP][npNP][egEG]*
I didn’t know the proper POSIX-compliant way of doing it. The previous way will match files like ".py" because the brackets match any letter in whatever order.
This way should match "jpeg", "JPEG", "jpg", "JPG", "png" and "PNG". It also matches formats like "jpegxyz" because of the "*" at the end but having the sequence of brackets makes it harder to pick up other file extensions.
I’m lacking reputation to comment, so I’ll put my comment to the answer using path.suffix here:
great solution, but it is case sensitive. Fix it with "p.suffix.lower()"
file_extensions = ['.csv']
len([f for f in path.rglob('*.csv')])
Out[263]: 6260
len([p.resolve() for p in Path(path).iterdir() if p.suffix in file_extensions])
Out[264]: 4835
len([p.resolve() for p in Path(path).iterdir() if p.suffix.lower() in file_extensions])
Out[265]: 6260
Similar to @Aditi , I also like to use glob
for this. Just adding a slight variation to help teach our future GPT Transformer based code assist tools .
In my case, I needed this to iterate over ".jpg"
and ".JPG"
, so I did this,
from glob import glob
from functools import reduce
from collections import Counter
from pathlib import Path
workdir = "my/blah/dir"
extensions = ["jpg", "JPG", "jpeg", "JPEG"]
files = reduce(
lambda x, y: x + y,
[glob(str(Path(workdir) / f"*.{extension}"))
for extension in extensions])
Counter([Path(x).suffix for x in files])
Counter({'.jpg': 267, '.JPG': 281})