Python – Check for exact string in file name
Question:
I have a folder where each file is named after a number (i.e. img 1, img 2, img-3, 4-img, etc). I want to get files by exact string (so if I enter ‘4’ as an input, it should only return files with ‘4’ and not any files containing ’14’ or 40′, for example. My problem is that the program returns all files as long as it matches the string. Note, the numbers aren’t always in the same spot (for same files its at the end, for others it’s in the middle)
For instance, if my folder has the files ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4', 'ep.4.', 'ep.4 ', 'ep. 4. ',ep4xxx, 'ep 4 ', '404ep']
,and I want only files with the exact number 4
in them, then I would only want to return ['ep 4', 'img4', '4xxx','file 4.mp4','ep.4.','ep.4 ', 'ep. 4. ',ep4xxx,'ep 4 ','404ep]
here is what I have (in this case I only want to return all mp4 file type)
for (root, dirs, file) in os.walk(source_folder):
for f in file:
if '.mp4' and ('4') in f:
print(f)
Tried ==
instead of in
Answers:
We can use re.search
along with a list comprehension for a regex option:
files = ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4']
num = 4
regex = r'(?<!d)' + str(num) + r'(?!d)'
output = [f for f in files if re.search(regex, f)]
print(output) # ['ep 4', 'img4', '4xxx', 'file.mp4', 'file 4.mp4']
this can be accomplished with the following function
import os
files = ["ep 4", "xxx 3 ", "img4", "4xxx", "ep-40", "file.mp4", "file 4.mp4"]
desired_output = ["ep 4", "img4", "4xxx", "file 4.mp4"]
def number_filter(files, number):
filtered_files = []
for file_name in files:
# if the number is not present, we can skip this file
if file_name.count(str(number)) == 0:
continue
# if the number is present in the extension, but not in the file name, we can skip this file
name, ext = os.path.splitext(file_name)
if (
isinstance(ext, str)
and ext.count(str(number)) > 0
and isinstance(name, str)
and name.count(str(number)) == 0
):
continue
# if the number is preseent in the file name, we must determine if it's part of a different number
num_index = file_name.index(str(number))
# if the number is at the beginning of the file name
if num_index == 0:
# check if the next character is a digit
if file_name[num_index + len(str(number))].isdigit():
continue
# if the number is at the end of the file name
elif num_index == len(file_name) - len(str(number)):
# check if the previous character is a digit
if file_name[num_index - 1].isdigit():
continue
# if it's somewhere in the middle
else:
# check if the previous and next characters are digits
if (
file_name[num_index - 1].isdigit()
or file_name[num_index + len(str(number))].isdigit()
):
continue
print(file_name)
filtered_files.append(file_name)
return filtered_files
output = number_filter(files, 4)
for file in output:
assert file in desired_output
for file in desired_output:
assert file in output
Judging by your inputs, your desired regular expression needs to meet the following criteria:
- Match the number provided, exactly
- Ignore number matches in the file extension, if present
- Handle file names that include spaces
I think this will meet all these requirements:
def generate(n):
return re.compile(r'^[^.d]*' + str(n) + r'[^.d]*(..*)?$')
def check_files(n, files):
regex = generate(n)
return [f for f in files if regex.fullmatch(f)]
Usage:
>>> check_files(4, ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4'])
['ep 4', 'img4', '4xxx', 'file 4.mp4']
Note that this solution involves creating a Pattern object and using that object to check each file. This strategy offers a performance benefit over calling re.fullmatch
with the pattern and filename directly, as the pattern does not have to be compiled for each call.
This solution does have one drawback: it assumes that filenames are formatted as name.extension
and that the value you’re searching for is in the name
part. Because of the greedy nature of regular expressions, if you allow for file names with .
then you won’t be able to exclude extensions from the search. Ergo, modifying this to match ep.4
would also cause it to match file.mp4
. That being said, there is a workaround for this, which is to strip extensions from the file name before doing the match:
def generate(n):
return re.compile(r'^[^d]*' + str(n) + r'[^d]*$')
def strip_extension(f):
return f.removesuffix('.mp4')
def check_files(n, files):
regex = generate(n)
return [f for f in files if regex.fullmatch(strip_extension(f))]
Note that this solution now includes the .
in the match condition and does not exclude an extension. Instead, it relies on preprocessing (the strip_extension
function) to remove any file extensions from the filename before matching.
As an addendum, occasionally you’ll get files have the number prefixed with zeroes (ex. 004, 0001, etc.). You can modify the regular expression to handle this case as well:
def generate(n):
return re.compile(r'^[^d]*0*' + str(n) + r'[^d]*$')
I have a folder where each file is named after a number (i.e. img 1, img 2, img-3, 4-img, etc). I want to get files by exact string (so if I enter ‘4’ as an input, it should only return files with ‘4’ and not any files containing ’14’ or 40′, for example. My problem is that the program returns all files as long as it matches the string. Note, the numbers aren’t always in the same spot (for same files its at the end, for others it’s in the middle)
For instance, if my folder has the files ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4', 'ep.4.', 'ep.4 ', 'ep. 4. ',ep4xxx, 'ep 4 ', '404ep']
,and I want only files with the exact number 4
in them, then I would only want to return ['ep 4', 'img4', '4xxx','file 4.mp4','ep.4.','ep.4 ', 'ep. 4. ',ep4xxx,'ep 4 ','404ep]
here is what I have (in this case I only want to return all mp4 file type)
for (root, dirs, file) in os.walk(source_folder):
for f in file:
if '.mp4' and ('4') in f:
print(f)
Tried ==
instead of in
We can use re.search
along with a list comprehension for a regex option:
files = ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4']
num = 4
regex = r'(?<!d)' + str(num) + r'(?!d)'
output = [f for f in files if re.search(regex, f)]
print(output) # ['ep 4', 'img4', '4xxx', 'file.mp4', 'file 4.mp4']
this can be accomplished with the following function
import os
files = ["ep 4", "xxx 3 ", "img4", "4xxx", "ep-40", "file.mp4", "file 4.mp4"]
desired_output = ["ep 4", "img4", "4xxx", "file 4.mp4"]
def number_filter(files, number):
filtered_files = []
for file_name in files:
# if the number is not present, we can skip this file
if file_name.count(str(number)) == 0:
continue
# if the number is present in the extension, but not in the file name, we can skip this file
name, ext = os.path.splitext(file_name)
if (
isinstance(ext, str)
and ext.count(str(number)) > 0
and isinstance(name, str)
and name.count(str(number)) == 0
):
continue
# if the number is preseent in the file name, we must determine if it's part of a different number
num_index = file_name.index(str(number))
# if the number is at the beginning of the file name
if num_index == 0:
# check if the next character is a digit
if file_name[num_index + len(str(number))].isdigit():
continue
# if the number is at the end of the file name
elif num_index == len(file_name) - len(str(number)):
# check if the previous character is a digit
if file_name[num_index - 1].isdigit():
continue
# if it's somewhere in the middle
else:
# check if the previous and next characters are digits
if (
file_name[num_index - 1].isdigit()
or file_name[num_index + len(str(number))].isdigit()
):
continue
print(file_name)
filtered_files.append(file_name)
return filtered_files
output = number_filter(files, 4)
for file in output:
assert file in desired_output
for file in desired_output:
assert file in output
Judging by your inputs, your desired regular expression needs to meet the following criteria:
- Match the number provided, exactly
- Ignore number matches in the file extension, if present
- Handle file names that include spaces
I think this will meet all these requirements:
def generate(n):
return re.compile(r'^[^.d]*' + str(n) + r'[^.d]*(..*)?$')
def check_files(n, files):
regex = generate(n)
return [f for f in files if regex.fullmatch(f)]
Usage:
>>> check_files(4, ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4'])
['ep 4', 'img4', '4xxx', 'file 4.mp4']
Note that this solution involves creating a Pattern object and using that object to check each file. This strategy offers a performance benefit over calling re.fullmatch
with the pattern and filename directly, as the pattern does not have to be compiled for each call.
This solution does have one drawback: it assumes that filenames are formatted as name.extension
and that the value you’re searching for is in the name
part. Because of the greedy nature of regular expressions, if you allow for file names with .
then you won’t be able to exclude extensions from the search. Ergo, modifying this to match ep.4
would also cause it to match file.mp4
. That being said, there is a workaround for this, which is to strip extensions from the file name before doing the match:
def generate(n):
return re.compile(r'^[^d]*' + str(n) + r'[^d]*$')
def strip_extension(f):
return f.removesuffix('.mp4')
def check_files(n, files):
regex = generate(n)
return [f for f in files if regex.fullmatch(strip_extension(f))]
Note that this solution now includes the .
in the match condition and does not exclude an extension. Instead, it relies on preprocessing (the strip_extension
function) to remove any file extensions from the filename before matching.
As an addendum, occasionally you’ll get files have the number prefixed with zeroes (ex. 004, 0001, etc.). You can modify the regular expression to handle this case as well:
def generate(n):
return re.compile(r'^[^d]*0*' + str(n) + r'[^d]*$')