Python glob but against a list of strings rather than the filesystem Python glob but against a list of strings rather than the filesystem python python

Python glob but against a list of strings rather than the filesystem


The glob module uses the fnmatch module for individual path elements.

That means the path is split into the directory name and the filename, and if the directory name contains meta characters (contains any of the characters [, * or ?) then these are expanded recursively.

If you have a list of strings that are simple filenames, then just using the fnmatch.filter() function is enough:

import fnmatchmatching = fnmatch.filter(filenames, pattern)

but if they contain full paths, you need to do more work as the regular expression generated doesn't take path segments into account (wildcards don't exclude the separators nor are they adjusted for cross-platform path matching).

You can construct a simple trie from the paths, then match your pattern against that:

import fnmatchimport globimport os.pathfrom itertools import product# Cross-Python dictionary views on the keys if hasattr(dict, 'viewkeys'):    # Python 2    def _viewkeys(d):        return d.viewkeys()else:    # Python 3    def _viewkeys(d):        return d.keys()def _in_trie(trie, path):    """Determine if path is completely in trie"""    current = trie    for elem in path:        try:            current = current[elem]        except KeyError:            return False    return None in currentdef find_matching_paths(paths, pattern):    """Produce a list of paths that match the pattern.    * paths is a list of strings representing filesystem paths    * pattern is a glob pattern as supported by the fnmatch module    """    if os.altsep:  # normalise        pattern = pattern.replace(os.altsep, os.sep)    pattern = pattern.split(os.sep)    # build a trie out of path elements; efficiently search on prefixes    path_trie = {}    for path in paths:        if os.altsep:  # normalise            path = path.replace(os.altsep, os.sep)        _, path = os.path.splitdrive(path)        elems = path.split(os.sep)        current = path_trie        for elem in elems:            current = current.setdefault(elem, {})        current.setdefault(None, None)  # sentinel    matching = []    current_level = [path_trie]    for subpattern in pattern:        if not glob.has_magic(subpattern):            # plain element, element must be in the trie or there are            # 0 matches            if not any(subpattern in d for d in current_level):                return []            matching.append([subpattern])            current_level = [d[subpattern] for d in current_level if subpattern in d]        else:            # match all next levels in the trie that match the pattern            matched_names = fnmatch.filter({k for d in current_level for k in d}, subpattern)            if not matched_names:                # nothing found                return []            matching.append(matched_names)            current_level = [d[n] for d in current_level for n in _viewkeys(d) & set(matched_names)]    return [os.sep.join(p) for p in product(*matching)            if _in_trie(path_trie, p)]

This mouthful can quickly find matches using globs anywhere along the path:

>>> paths = ['/foo/bar/baz', '/spam/eggs/baz', '/foo/bar/bar']>>> find_matching_paths(paths, '/foo/bar/*')['/foo/bar/baz', '/foo/bar/bar']>>> find_matching_paths(paths, '/*/bar/b*')['/foo/bar/baz', '/foo/bar/bar']>>> find_matching_paths(paths, '/*/[be]*/b*')['/foo/bar/baz', '/foo/bar/bar', '/spam/eggs/baz']


Good artists copy; great artists steal.

I stole ;)

fnmatch.translate translates globs ? and * to regex . and .* respectively. I tweaked it not to.

import redef glob2re(pat):    """Translate a shell PATTERN to a regular expression.    There is no way to quote meta-characters.    """    i, n = 0, len(pat)    res = ''    while i < n:        c = pat[i]        i = i+1        if c == '*':            #res = res + '.*'            res = res + '[^/]*'        elif c == '?':            #res = res + '.'            res = res + '[^/]'        elif c == '[':            j = i            if j < n and pat[j] == '!':                j = j+1            if j < n and pat[j] == ']':                j = j+1            while j < n and pat[j] != ']':                j = j+1            if j >= n:                res = res + '\\['            else:                stuff = pat[i:j].replace('\\','\\\\')                i = j+1                if stuff[0] == '!':                    stuff = '^' + stuff[1:]                elif stuff[0] == '^':                    stuff = '\\' + stuff                res = '%s[%s]' % (res, stuff)        else:            res = res + re.escape(c)    return res + '\Z(?ms)'

This one à la fnmatch.filter, both re.match and re.search work.

def glob_filter(names,pat):    return (name for name in names if re.match(glob2re(pat),name))

Glob patterns and strings found on this page pass test.

pat_dict = {            'a/b/*/f.txt': ['a/b/c/f.txt', 'a/b/q/f.txt', 'a/b/c/d/f.txt','a/b/c/d/e/f.txt'],            '/foo/bar/*': ['/foo/bar/baz', '/spam/eggs/baz', '/foo/bar/bar'],            '/*/bar/b*': ['/foo/bar/baz', '/foo/bar/bar'],            '/*/[be]*/b*': ['/foo/bar/baz', '/foo/bar/bar'],            '/foo*/bar': ['/foolicious/spamfantastic/bar', '/foolicious/bar']        }for pat in pat_dict:    print('pattern :\t{}\nstrings :\t{}'.format(pat,pat_dict[pat]))    print('matched :\t{}\n'.format(list(glob_filter(pat_dict[pat],pat))))


On Python 3.4+ you can just use PurePath.match.

pathlib.PurePath(path_string).match(pattern)

On Python 3.3 or earlier (including 2.x), get pathlib from PyPI.

Note that to get platform-independent results (which will depend on why you're running this) you'd want to explicitly state PurePosixPath or PureWindowsPath.