Python glob but against a list of strings rather than the filesystem

Question

I want to be able to match a pattern in glob format to a list of strings, rather than to actual files in the filesystem. Is there any way to do this, or convert a glob pattern easily to a regex?

Asked By: Jason S

||

Source

Answer 1

never mind, I found it. I want the fnmatch module.

Answered By: Jason S

Answer 2

While fnmatch.fnmatch can be used directly to check whether a pattern matches a filename or not, you can also use the fnmatch.translate method to generate the regex out of the given fnmatch pattern:

>>> import fnmatch
>>> fnmatch.translate('*.txt')
'.*\.txt\Z(?ms)'

From the documenation:

fnmatch.translate(pattern)

Return the shell-style pattern converted to a regular expression.

Answered By: Anshul Goyal

Answer 3

The glob module uses the fnmatch module for individual path elements.

That means the path is split into the directory name and the filename, and if the directory name contains meta characters (contains any of the characters [, * or ?) then these are expanded recursively.

If you have a list of strings that are simple filenames, then just using the fnmatch.filter() function is enough:

import fnmatch

matching = fnmatch.filter(filenames, pattern)

but if they contain full paths, you need to do more work as the regular expression generated doesn’t take path segments into account (wildcards don’t exclude the separators nor are they adjusted for cross-platform path matching).

You can construct a simple trie from the paths, then match your pattern against that:

import fnmatch
import glob
import os.path
from itertools import product


# Cross-Python dictionary views on the keys 
if hasattr(dict, 'viewkeys'):
    # Python 2
    def _viewkeys(d):
        return d.viewkeys()
else:
    # Python 3
    def _viewkeys(d):
        return d.keys()


def _in_trie(trie, path):
    """Determine if path is completely in trie"""
    current = trie
    for elem in path:
        try:
            current = current[elem]
        except KeyError:
            return False
    return None in current


def find_matching_paths(paths, pattern):
    """Produce a list of paths that match the pattern.

    * paths is a list of strings representing filesystem paths
    * pattern is a glob pattern as supported by the fnmatch module

    """
    if os.altsep:  # normalise
        pattern = pattern.replace(os.altsep, os.sep)
    pattern = pattern.split(os.sep)

    # build a trie out of path elements; efficiently search on prefixes
    path_trie = {}
    for path in paths:
        if os.altsep:  # normalise
            path = path.replace(os.altsep, os.sep)
        _, path = os.path.splitdrive(path)
        elems = path.split(os.sep)
        current = path_trie
        for elem in elems:
            current = current.setdefault(elem, {})
        current.setdefault(None, None)  # sentinel

    matching = []

    current_level = [path_trie]
    for subpattern in pattern:
        if not glob.has_magic(subpattern):
            # plain element, element must be in the trie or there are
            # 0 matches
            if not any(subpattern in d for d in current_level):
                return []
            matching.append([subpattern])
            current_level = [d[subpattern] for d in current_level if subpattern in d]
        else:
            # match all next levels in the trie that match the pattern
            matched_names = fnmatch.filter({k for d in current_level for k in d}, subpattern)
            if not matched_names:
                # nothing found
                return []
            matching.append(matched_names)
            current_level = [d[n] for d in current_level for n in _viewkeys(d) & set(matched_names)]

    return [os.sep.join(p) for p in product(*matching)
            if _in_trie(path_trie, p)]

This mouthful can quickly find matches using globs anywhere along the path:

>>> paths = ['/foo/bar/baz', '/spam/eggs/baz', '/foo/bar/bar']
>>> find_matching_paths(paths, '/foo/bar/*')
['/foo/bar/baz', '/foo/bar/bar']
>>> find_matching_paths(paths, '/*/bar/b*')
['/foo/bar/baz', '/foo/bar/bar']
>>> find_matching_paths(paths, '/*/[be]*/b*')
['/foo/bar/baz', '/foo/bar/bar', '/spam/eggs/baz']

Answered By: Martijn Pieters

Answer 4

On Python 3.4+ you can just use PurePath.match.

pathlib.PurePath(path_string).match(pattern)

On Python 3.3 or earlier (including 2.x), get pathlib from PyPI.

Note that to get platform-independent results (which will depend on why you’re running this) you’d want to explicitly state PurePosixPath or PureWindowsPath.

Answered By: Veedrac

Answer 5

Good artists copy; great artists steal.

I stole 😉

fnmatch.translate translates globs ? and * to regex . and .* respectively. I tweaked it not to.

import re

def glob2re(pat):
    """Translate a shell PATTERN to a regular expression.

    There is no way to quote meta-characters.
    """

    i, n = 0, len(pat)
    res = ''
    while i < n:
        c = pat[i]
        i = i+1
        if c == '*':
            #res = res + '.*'
            res = res + '[^/]*'
        elif c == '?':
            #res = res + '.'
            res = res + '[^/]'
        elif c == '[':
            j = i
            if j < n and pat[j] == '!':
                j = j+1
            if j < n and pat[j] == ']':
                j = j+1
            while j < n and pat[j] != ']':
                j = j+1
            if j >= n:
                res = res + '\['
            else:
                stuff = pat[i:j].replace('\','\\')
                i = j+1
                if stuff[0] == '!':
                    stuff = '^' + stuff[1:]
                elif stuff[0] == '^':
                    stuff = '\' + stuff
                res = '%s[%s]' % (res, stuff)
        else:
            res = res + re.escape(c)
    return res + 'Z(?ms)'

This one à la fnmatch.filter, both re.match and re.search work.

def glob_filter(names,pat):
    return (name for name in names if re.match(glob2re(pat),name))

Glob patterns and strings found on this page pass test.

pat_dict = {
            'a/b/*/f.txt': ['a/b/c/f.txt', 'a/b/q/f.txt', 'a/b/c/d/f.txt','a/b/c/d/e/f.txt'],
            '/foo/bar/*': ['/foo/bar/baz', '/spam/eggs/baz', '/foo/bar/bar'],
            '/*/bar/b*': ['/foo/bar/baz', '/foo/bar/bar'],
            '/*/[be]*/b*': ['/foo/bar/baz', '/foo/bar/bar'],
            '/foo*/bar': ['/foolicious/spamfantastic/bar', '/foolicious/bar']

        }
for pat in pat_dict:
    print('pattern :t{}nstrings :t{}'.format(pat,pat_dict[pat]))
    print('matched :t{}n'.format(list(glob_filter(pat_dict[pat],pat))))

Answered By: Nizam Mohamed

Answer 6

An extension to @Veedrac PurePath.match answer that can be applied to a lists of strings:

# Python 3.4+
from pathlib import Path

path_list = ["foo/bar.txt", "spam/bar.txt", "foo/eggs.txt"]
# convert string to pathlib.PosixPath / .WindowsPath, then apply PurePath.match to list
print([p for p in path_list if Path(p).match("ba*")])  # "*ba*" also works
# output: ['foo/bar.txt', 'spam/bar.txt']

print([p for p in path_list if Path(p).match("*o/ba*")])
# output: ['foo/bar.txt']

It is preferable to use pathlib.Path() over pathlib.PurePath(), because then you don’t have to worry about the underlying filesystem.

Answered By: NumesSanguis

Answer 7

I wanted to add support for recursive glob patterns, i.e. things/**/*.py and have relative path matching so example*.py doesn’t match with folder/example_stuff.py.

Here is my approach:


from os import path
import re

def recursive_glob_filter(files, glob):
    # Convert to regex and add start of line match
    pattern_re = '^' + fnmatch_translate(glob)

    # fnmatch does not escape path separators so escape them
    if path.sep in pattern_re and not r'{}'.format(path.sep) in pattern_re:
        pattern_re = pattern_re.replace('/', r'/')

    # Replace `*` with one that ignores path separators
    sep_respecting_wildcard = '[^{}]*'.format(path.sep)
    pattern_re = pattern_re.replace('.*', sep_respecting_wildcard)

    # And now for `**` we have `[^/]*[^/]*`, so replace that with `.*`
    # to match all patterns in-between
    pattern_re = pattern_re.replace(2 * sep_respecting_wildcard, '.*')
    compiled_re = re.compile(pattern_re)
    return filter(compiled_re.search, files)

Answered By: Carson Gee

Answer 8

Here is a glob that can deal with escaped punctuation. It does not stop on path separators. I’m posting it here because it matches the title of the question.

To use on a list:

rex = glob_to_re(glob_pattern)
rex = r'(?s:%s)Z' % rex # Can match newline; match whole string.
rex = re.compile(rex)
matches = [name for name in names if rex.match(name)]

Here’s the code:

import re as _re

class GlobSyntaxError(SyntaxError):
    pass

def glob_to_re(pattern):
    r"""
    Given pattern, a unicode string, return the equivalent regular expression.
    Any special character * ? [ ! - ]  can be escaped by preceding it with 
    backslash ('') in the pattern.  Forward-slashes ('/') and escaped 
    backslashes ('\') are treated as ordinary characters, not boundaries.

    Here is the language glob_to_re understands.
    Earlier alternatives within rules have precedence.  
        pattern = item*
        item    = '*'  |  '?'  |  '[!' set ']'  |  '[' set ']'  |  literal
        set     = element element*
        element = literal '-' literal  |  literal
        literal = '' char  |  char other than   [  ] and sometimes -
    glob_to_re does not understand "{a,b...}".
    """
    # (Note: the docstring above is r""" ... """ to preserve backslashes.)
    def expect_char(i, context):
        if i >= len(pattern):
            s = "Unfinished %s: %r, position %d." % (context, pattern, i)
            raise GlobSyntaxError(s)
    
    def literal_to_re(i, context="pattern", bad="[]"):
        if pattern[i] == '\':
            i += 1
            expect_char(i, "backslashed literal")
        else:
            if pattern[i] in bad:
                s = "Unexpected %r in %s: %r, position %d." 
                    % (pattern[i], context, pattern, i)
                raise GlobSyntaxError(s)
        return _re.escape(pattern[i]), i + 1

    def set_to_re(i):
        assert pattern[i] == '['
        set_re = "["
        i += 1
        try:
            if pattern[i] == '!':
                set_re += '^'
                i += 1
            while True:
                lit_re, i = literal_to_re(i, "character set", bad="[-]")
                set_re += lit_re
                if pattern[i] == '-':
                    set_re += '-'
                    i += 1
                    expect_char(i, "character set range")
                    lit_re, i = literal_to_re(i, "character set range", bad="[-]")
                    set_re += lit_re
                if pattern[i] == ']':
                    return set_re + ']', i + 1
                
        except IndexError:
            expect_char(i, "character set")  # Trigger "unfinished" error.

    i = 0
    re_pat = ""
    while i < len(pattern):
        if pattern[i] == '*':
            re_pat += ".*"
            i += 1
        elif pattern[i] == '?':
            re_pat += "."
            i += 1
        elif pattern[i] == '[':
            set_re, i = set_to_re(i)
            re_pat += set_re
        else:
            lit_re, i = literal_to_re(i)
            re_pat += lit_re
    return re_pat

Answered By: SteveWithamDuplicate

Answer 9

Can’t say how efficient it is, but it is much less verbose, much less complicated, more complete, and possibly more secure/reliable than other solutions.

Supported syntax:

* — matches zero or more characters.
** (actually, it’s either **/ or /**) — matches zero or more subdirectories.
? — matches one character.
[] — matches one character within brackets.
[!] — matches one character not within brackets.
Due to escaping with , only / can be used as a path separator.

Order of operation:

Escape special RE chars in glob.
Generate RE for tokenization of escaped glob.
Replace escaped glob tokens by equivalent RE.

import re
from sys import hexversion, implementation
# Support for insertion-preserving/ordered dicts became language feature in Python 3.7, but works in CPython since 3.6.
if hexversion >= 0x03070000 or (implementation.name == 'cpython' and hexversion >= 0x03060000):
    ordered_dict = dict
else:
    from collections import OrderedDict as ordered_dict

escaped_glob_tokens_to_re = ordered_dict((
    # Order of ``**/`` and ``/**`` in RE tokenization pattern doesn't matter because ``**/`` will be caught first no matter what, making ``/**`` the only option later on.
    # W/o leading or trailing ``/`` two consecutive asterisks will be treated as literals.
    ('/**', '(?:/.+?)*'), # Edge-case #1. Catches recursive globs in the middle of path. Requires edge case #2 handled after this case.
    ('**/', '(?:^.+?/)*'), # Edge-case #2. Catches recursive globs at the start of path. Requires edge case #1 handled before this case. ``^`` is used to ensure proper location for ``**/``.
    ('*', '[^/]*'), # ``[^/]*`` is used to ensure that ``*`` won't match subdirs, as with naive ``.*?`` solution.
    ('?', '.'),
    ('[*]', '*'), # Escaped special glob character.
    ('[?]', '?'), # Escaped special glob character.
    ('[!', '[^'), # Requires ordered dict, so that ``[!`` preceded ``[`` in RE pattern. Needed mostly to differentiate between ``!`` used within character class ``[]`` and outside of it, to avoid faulty conversion.
    ('[', '['),
    (']', ']'),
))

escaped_glob_replacement = re.compile('(%s)' % '|'.join(escaped_glob_tokens_to_re).replace('\', '\\\'))

def glob_to_re(pattern):
    return escaped_glob_replacement.sub(lambda match: escaped_glob_tokens_to_re[match.group(0)], re.escape(pattern))

if __name__ == '__main__':
    validity_paths_globs = (
        (True, 'foo.py', 'foo.py'),
        (True, 'foo.py', 'fo[o].py'),
        (True, 'fob.py', 'fo[!o].py'),
        (True, '*foo.py', '[*]foo.py'),
        (True, 'foo.py', '**/foo.py'),
        (True, 'baz/duck/bar/bam/quack/foo.py', '**/bar/**/foo.py'),
        (True, 'bar/foo.py', '**/foo.py'),
        (True, 'bar/baz/foo.py', 'bar/**'),
        (False, 'bar/baz/foo.py', 'bar/*'),
        (False, 'bar/baz/foo.py', 'bar**/foo.py'),
        (True, 'bar/baz/foo.py', 'bar/**/foo.py'),
        (True, 'bar/baz/wut/foo.py', 'bar/**/foo.py'),
    )
    results = []
    for seg in validity_paths_globs:
        valid, path, glob_pat = seg
        print('valid:', valid)
        print('path:', path)
        print('glob pattern:', glob_pat)
        re_pat = glob_to_re(glob_pat)
        print('RE pattern:', re_pat)
        match = re.fullmatch(re_pat, path)
        print('match:', match)
        result = bool(match) == valid
        results.append(result)
        print('result was expected:', result)
        print('-'*79)
    print('all results were expected:', all(results))
    print('='*79)

Answered By: Pugsley

Answer 10

My solution is similar to Nizam’s but with a few changes:

Support for ** wildcards
Prevents patterns like [^abc] from matching /
Updated to use fnmatch.translate() from Python 3.8.13 as a base

WARNING:

There are some slight differences to glob.glob() which this solution suffers from (along with most of the other solutions), feel free to suggest changes in the comments if you know how to fix them:

* and ? should not match file names starting with .
** should also match 0 folders when used like /**/

Code:

import re

def glob_to_re(pat: str) -> str:
    """Translate a shell PATTERN to a regular expression.

    Derived from `fnmatch.translate()` of Python version 3.8.13
    SOURCE: https://github.com/python/cpython/blob/v3.8.13/Lib/fnmatch.py#L74-L128
    """

    i, n = 0, len(pat)
    res = ''
    while i < n:
        c = pat[i]
        i = i+1
        if c == '*':
            # -------- CHANGE START --------
            # prevent '*' matching directory boundaries, but allow '**' to match them
            j = i
            if j < n and pat[j] == '*':
                res = res + '.*'
                i = j+1
            else:
                res = res + '[^/]*'
            # -------- CHANGE END ----------
        elif c == '?':
            # -------- CHANGE START --------
            # prevent '?' matching directory boundaries
            res = res + '[^/]'
            # -------- CHANGE END ----------
        elif c == '[':
            j = i
            if j < n and pat[j] == '!':
                j = j+1
            if j < n and pat[j] == ']':
                j = j+1
            while j < n and pat[j] != ']':
                j = j+1
            if j >= n:
                res = res + '\['
            else:
                stuff = pat[i:j]
                if '--' not in stuff:
                    stuff = stuff.replace('\', r'\')
                else:
                    chunks = []
                    k = i+2 if pat[i] == '!' else i+1
                    while True:
                        k = pat.find('-', k, j)
                        if k < 0:
                            break
                        chunks.append(pat[i:k])
                        i = k+1
                        k = k+3
                    chunks.append(pat[i:j])
                    # Escape backslashes and hyphens for set difference (--).
                    # Hyphens that create ranges shouldn't be escaped.
                    stuff = '-'.join(s.replace('\', r'\').replace('-', r'-')
                                     for s in chunks)
                # Escape set operations (&&, ~~ and ||).
                stuff = re.sub(r'([&~|])', r'\1', stuff)
                i = j+1
                if stuff[0] == '!':
                    # -------- CHANGE START --------
                    # ensure sequence negations don't match directory boundaries
                    stuff = '^/' + stuff[1:]
                    # -------- CHANGE END ----------
                elif stuff[0] in ('^', '['):
                    stuff = '\' + stuff
                res = '%s[%s]' % (res, stuff)
        else:
            res = res + re.escape(c)
    return r'(?s:%s)Z' % res

Test Cases:

Here are some test cases comparing the built-in fnmatch.translate() to the above glob_to_re().

import fnmatch

test_cases = [
    # path, pattern, old_should_match, new_should_match
    ("/path/to/foo", "*", True, False),
    ("/path/to/foo", "**", True, True),
    ("/path/to/foo", "/path/*", True, False),
    ("/path/to/foo", "/path/**", True, True),
    ("/path/to/foo", "/path/to/*", True, True),
    ("/path/to", "/path?to", True, False),
    ("/path/to", "/path[!abc]to", True, False),
]

for path, pattern, old_should_match, new_should_match in test_cases:

    old_re = re.compile(fnmatch.translate(pattern))
    old_match = bool(old_re.match(path))
    if old_match is not old_should_match:
        raise AssertionError(
            f"regex from `fnmatch.translate()` should match path "
            f"'{path}' when given pattern: {pattern}"
        )

    new_re = re.compile(glob_to_re(pattern))
    new_match = bool(new_re.match(path))
    if new_match is not new_should_match:
        raise AssertionError(
            f"regex from `glob_to_re()` should match path "
            f"'{path}' when given pattern: {pattern}"
        )

Example:

Here is an example that uses glob_to_re() with a list of strings.

glob_pattern = "/path/to/*.txt"
glob_re = re.compile(glob_to_re(glob_pattern))

input_paths = [
    "/path/to/file_1.txt",
    "/path/to/file_2.txt",
    "/path/to/folder/file_3.txt",
    "/path/to/folder/file_4.txt",
]

filtered_paths = [path for path in input_paths if glob_re.match(path)]
# filtered_paths = ["/path/to/file_1.txt", "/path/to/file_2.txt"]

Answered By: Mathew Wicks

Python glob but against a list of strings rather than the filesystem

Question:

Answers: