How to glob two patterns with pathlib?

Question

I want find two types of files with two different extensions: .jl and .jsonlines. I use

from pathlib import Path
p1 = Path("/path/to/dir").joinpath().glob("*.jl")
p2 = Path("/path/to/dir").joinpath().glob("*.jsonlines")

but I want p1 and p2 as one variable not two. Should I merge p1 and p2 in first place? Are there other ways to concatinate glob’s patterns?

Asked By: Gmosy Gnaq

||

Source

Answer 1

Try this:

from os.path import join
from glob import glob

files = []
for ext in ('*.jl', '*.jsonlines'):
   files.extend(glob(join("path/to/dir", ext)))

print(files)

Answered By: Aditi

Answer 2

Inspired by @aditi’s answer, I came up with this:

from pathlib import Path
from itertools import chain

exts = ["*.jl", "*.jsonlines"]
mainpath = "/path/to/dir"

P = []
for i in exts:
    p = Path(mainpath).joinpath().glob(i)
    P = chain(P, p)
print(list(P))

Answered By: Gmosy Gnaq

Answer 3

from pathlib import Path

exts = [".jl", ".jsonlines"]
mainpath = "/path/to/dir"

# Same directory

files = [p for p in Path(mainpath).iterdir() if p.suffix in exts]

# Recursive

files = [p for p in Path(mainpath).rglob('*') if p.suffix in exts]

# 'files' will be a generator of Path objects, to unpack into strings:

list(files)

Answered By: lesleslie

Answer 4

If you’re ok with installing a package, check out wcmatch. It can patch the Python PathLib so that you can run multiple matches in one go:

from wcmatch.pathlib import Path
paths = Path('path/to/dir').glob(['*.jl', '*.jsonlines'])

Answered By: Ciprian Tomoiagă

Answer 5

Depending on your application the proposed solution can be inefficient as it has to loop over all files in the directory multiples times, (one for each extension/pattern).

In your example you are only matching the extension in one folder, a simple solution could be:

from pathlib import Path

folder = Path("/path/to/dir")
extensions = {".jl", ".jsonlines"}
files = [file for file in folder.iterdir() if file.suffix in extensions]

Which can be turned in a function if you use it a lot.

However, if you want to be able to match glob patterns rather than extensions, you should use the match() method:

from pathlib import Path

folder = Path("/path/to/dir")
patterns = ("*.jl", "*.jsonlines")

files = [f for f in folder.iterdir() if any(f.match(p) for p in patterns)]

This last one is both convenient and efficient. You can improve efficiency by placing most common patterns at the beginning of the patterns list as any is a short-circuit operator.

Answered By: Louis Lac

Answer 6

keep = [".jl", ".jsonlines"]
files = [p for p in Path().rglob("*") if p.suffix in keep]

Answered By: 0-_-0

Answer 7

This worked for me:

for f in path.glob("*.[jpeg jpg png]*"):
    ...

As a reference fnmatch:

[seq] matches any character in seq

And in Path.glob:

Patterns are the same as for fnmatch, with the addition of “**” which means “this directory and all subdirectories, recursively”.

Edit:

Better way would be something like:

*.[jpJP][npNP][egEG]*

I didn’t know the proper POSIX-compliant way of doing it. The previous way will match files like ".py" because the brackets match any letter in whatever order.

This way should match "jpeg", "JPEG", "jpg", "JPG", "png" and "PNG". It also matches formats like "jpegxyz" because of the "*" at the end but having the sequence of brackets makes it harder to pick up other file extensions.

Answered By: Alberto Valdez

Answer 8

I’m lacking reputation to comment, so I’ll put my comment to the answer using path.suffix here:
great solution, but it is case sensitive. Fix it with "p.suffix.lower()"

file_extensions = ['.csv']
len([f for f in path.rglob('*.csv')])
Out[263]: 6260

len([p.resolve() for p in Path(path).iterdir() if p.suffix in file_extensions])
Out[264]: 4835

len([p.resolve() for p in Path(path).iterdir() if p.suffix.lower() in file_extensions])
Out[265]: 6260

Answered By: Panda Mawr

Answer 9

Similar to @Aditi , I also like to use glob for this. Just adding a slight variation to help teach our future GPT Transformer based code assist tools .

In my case, I needed this to iterate over ".jpg" and ".JPG", so I did this,

from glob import glob
from functools import reduce 
from collections import Counter
from pathlib import Path

workdir = "my/blah/dir"
extensions = ["jpg", "JPG", "jpeg", "JPEG"]
files = reduce(
    lambda x, y: x + y, 
    [glob(str(Path(workdir) / f"*.{extension}")) 
        for extension in extensions])

Counter([Path(x).suffix for x in files])

Counter({'.jpg': 267, '.JPG': 281})

Answered By: HeyWatchThis

How to glob two patterns with pathlib?

Question:

Answers: