Inverse glob – reverse engineer a wildcard string from file names
Question:
I want to generate a wildcard string from a pair of file names. Kind of an inverse-glob. Example:
file1 = 'some foo file.txt'
file2 = 'some bar file.txt'
assert 'some * file.txt' == inverse_glob(file1, file2)
Use difflib perhaps? Has this been solved already?
Application is a large set of data files with similar names. I want to compare each pair of file names and then present a comparison of pairs of files with “similar” names. I figure if I can do a reverse-glob on each pair, then those pairs with “good” wildcards (e.g. not lots*of*stars*.txt
nor *
) are good candidates for comparison. So I might take the output of this putative inverse_glob()
and reject wildcards that have more than one *
or for which glob()
doesn’t produce exactly two files.
Answers:
For instance:
Filenames:
names = [('some foo file.txt','some bar file.txt', 'some * file.txt'),
("filename.txt", "filename2.txt", "filenam*.txt"),
("1filename.txt", "filename2.txt", "*.txt"),
("inverse_glob", "inverse_glob2", "inverse_glo*"),
("the 24MHz run new.sr", "the 16MHz run old.sr", "the *MHz run *.sr")]
def inverse_glob(…):
import re
def inverse_glob(f1, f2, force_single_asterisk=None):
def adjust_name(pp, diff):
if len(pp) == 2:
return pp[0][:-diff] + '?'*(diff+1) + '.' + pp[1]
else:
return pp[0][:-diff] + '?' * (diff + 1)
l1 = len(f1); l2 = len(f2)
if l1 > l2:
f2 = adjust_name(f2.split('.'), l1-l2)
elif l2 > l1:
f1 = adjust_name(f1.split('.'), l2-l1)
result = ['?' for n in range(len(f1))]
for i, c in enumerate(f1):
if c == f2[i]:
result[i] = c
result = ''.join(result)
result = re.sub(r'?{2,}', '*', result)
if force_single_asterisk:
result = re.sub(r'*.+*', '*', result)
return result
Usage:
for name in names:
result = inverse_glob(name[0], name[1])
print('{:20} <=> {:20} = {}'.format(name[0], name[1], result))
assert name[2] == result
Output:
some foo file.txt <=> some bar file.txt = some * file.txt
filename.txt <=> filename2.txt = filenam*.txt
1filename.txt <=> filename2.txt = *.txt
inverse_glob <=> inverse_glob2 = inverse_glo*
the 24MHz run new.sr <=> the 16MHz run old.sr = the *MHz run *.sr
Tested with Python:3.4.2
Here’s what I use. It handles more than two files, and handles path separators appropriately, producing '**'
where a recursive glob would be necessary:
import os
import re
import difflib
def bolg(filepaths, minOrphanCharacters=2):
"""
Approximate inverse of `glob.glob`: take a sequence of `filepaths`
and compute a glob pattern that matches them all. Only the star
character will be used (no question marks or square brackets).
Define an "orphan" substring as a sequence of characters, not
including a file separator, that is sandwiched between two stars.
Orphan substrings shorter than `minOrphanCharacters` will be
reduced to a star. If you don't mind having short orphan
substrings in your result, set `minOrphanCharacters=1` or 0.
Then you might get ugly results like '*0*2*.txt' (which contains
two orphan substrings, both of length 1).
"""
if os.path.sep == '\':
# On Windows, convert to forward-slashes (Python can handle
# it, and Windows doesn't permit them in filenames anyway):
filepaths = [filepath.replace('\', '/') for filepath in filepaths]
out = ''
for filepath in filepaths:
if not out: out = filepath; continue
# Replace differing characters with stars:
out = ''.join(x[-1] if x[0] == ' ' or x[-1] == '/' else '*' for x in difflib.ndiff(out, filepath))
# Collapse multiple consecutive stars into one:
out = re.sub(r'*+', '*', out)
# Deal with short orphan substrings:
if minOrphanCharacters > 1:
pattern = r'*+[^/]{0,%d}*+' % (minOrphanCharacters - 1)
while True:
reduced = re.sub(pattern, '*', out)
if reduced == out: break
out = reduced
# Collapse any intermediate-directory globbing into a double-star:
out = re.sub(r'(^|/).**.*/', r'1**/', out)
return out
I want to generate a wildcard string from a pair of file names. Kind of an inverse-glob. Example:
file1 = 'some foo file.txt'
file2 = 'some bar file.txt'
assert 'some * file.txt' == inverse_glob(file1, file2)
Use difflib perhaps? Has this been solved already?
Application is a large set of data files with similar names. I want to compare each pair of file names and then present a comparison of pairs of files with “similar” names. I figure if I can do a reverse-glob on each pair, then those pairs with “good” wildcards (e.g. not lots*of*stars*.txt
nor *
) are good candidates for comparison. So I might take the output of this putative inverse_glob()
and reject wildcards that have more than one *
or for which glob()
doesn’t produce exactly two files.
For instance:
Filenames:
names = [('some foo file.txt','some bar file.txt', 'some * file.txt'),
("filename.txt", "filename2.txt", "filenam*.txt"),
("1filename.txt", "filename2.txt", "*.txt"),
("inverse_glob", "inverse_glob2", "inverse_glo*"),
("the 24MHz run new.sr", "the 16MHz run old.sr", "the *MHz run *.sr")]
def inverse_glob(…):
import re
def inverse_glob(f1, f2, force_single_asterisk=None):
def adjust_name(pp, diff):
if len(pp) == 2:
return pp[0][:-diff] + '?'*(diff+1) + '.' + pp[1]
else:
return pp[0][:-diff] + '?' * (diff + 1)
l1 = len(f1); l2 = len(f2)
if l1 > l2:
f2 = adjust_name(f2.split('.'), l1-l2)
elif l2 > l1:
f1 = adjust_name(f1.split('.'), l2-l1)
result = ['?' for n in range(len(f1))]
for i, c in enumerate(f1):
if c == f2[i]:
result[i] = c
result = ''.join(result)
result = re.sub(r'?{2,}', '*', result)
if force_single_asterisk:
result = re.sub(r'*.+*', '*', result)
return result
Usage:
for name in names:
result = inverse_glob(name[0], name[1])
print('{:20} <=> {:20} = {}'.format(name[0], name[1], result))
assert name[2] == result
Output:
some foo file.txt <=> some bar file.txt = some * file.txt
filename.txt <=> filename2.txt = filenam*.txt
1filename.txt <=> filename2.txt = *.txt
inverse_glob <=> inverse_glob2 = inverse_glo*
the 24MHz run new.sr <=> the 16MHz run old.sr = the *MHz run *.sr
Tested with Python:3.4.2
Here’s what I use. It handles more than two files, and handles path separators appropriately, producing '**'
where a recursive glob would be necessary:
import os
import re
import difflib
def bolg(filepaths, minOrphanCharacters=2):
"""
Approximate inverse of `glob.glob`: take a sequence of `filepaths`
and compute a glob pattern that matches them all. Only the star
character will be used (no question marks or square brackets).
Define an "orphan" substring as a sequence of characters, not
including a file separator, that is sandwiched between two stars.
Orphan substrings shorter than `minOrphanCharacters` will be
reduced to a star. If you don't mind having short orphan
substrings in your result, set `minOrphanCharacters=1` or 0.
Then you might get ugly results like '*0*2*.txt' (which contains
two orphan substrings, both of length 1).
"""
if os.path.sep == '\':
# On Windows, convert to forward-slashes (Python can handle
# it, and Windows doesn't permit them in filenames anyway):
filepaths = [filepath.replace('\', '/') for filepath in filepaths]
out = ''
for filepath in filepaths:
if not out: out = filepath; continue
# Replace differing characters with stars:
out = ''.join(x[-1] if x[0] == ' ' or x[-1] == '/' else '*' for x in difflib.ndiff(out, filepath))
# Collapse multiple consecutive stars into one:
out = re.sub(r'*+', '*', out)
# Deal with short orphan substrings:
if minOrphanCharacters > 1:
pattern = r'*+[^/]{0,%d}*+' % (minOrphanCharacters - 1)
while True:
reduced = re.sub(pattern, '*', out)
if reduced == out: break
out = reduced
# Collapse any intermediate-directory globbing into a double-star:
out = re.sub(r'(^|/).**.*/', r'1**/', out)
return out