Python – Check for exact string in file name

Question:

I have a folder where each file is named after a number (i.e. img 1, img 2, img-3, 4-img, etc). I want to get files by exact string (so if I enter ‘4’ as an input, it should only return files with ‘4’ and not any files containing ’14’ or 40′, for example. My problem is that the program returns all files as long as it matches the string. Note, the numbers aren’t always in the same spot (for same files its at the end, for others it’s in the middle)

For instance, if my folder has the files ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4', 'ep.4.', 'ep.4 ', 'ep. 4. ',ep4xxx, 'ep 4 ', '404ep'],and I want only files with the exact number 4 in them, then I would only want to return ['ep 4', 'img4', '4xxx','file 4.mp4','ep.4.','ep.4 ', 'ep. 4. ',ep4xxx,'ep 4 ','404ep]

here is what I have (in this case I only want to return all mp4 file type)

for (root, dirs, file) in os.walk(source_folder):
    for f in file:
        if '.mp4' and ('4') in f:
            print(f)

Tried == instead of in

Asked By: bull11trc

||

Answers:

We can use re.search along with a list comprehension for a regex option:

files = ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4']
num = 4
regex = r'(?<!d)' + str(num) + r'(?!d)'
output = [f for f in files if re.search(regex, f)]
print(output)  # ['ep 4', 'img4', '4xxx', 'file.mp4', 'file 4.mp4']
Answered By: Tim Biegeleisen

this can be accomplished with the following function

import os


files = ["ep 4", "xxx 3 ", "img4", "4xxx", "ep-40", "file.mp4", "file 4.mp4"]
desired_output = ["ep 4", "img4", "4xxx", "file 4.mp4"]


def number_filter(files, number):
    filtered_files = []
    for file_name in files:

        # if the number is not present, we can skip this file
        if file_name.count(str(number)) == 0:
            continue

        # if the number is present in the extension, but not in the file name, we can skip this file
        name, ext = os.path.splitext(file_name)

        if (
            isinstance(ext, str)
            and ext.count(str(number)) > 0
            and isinstance(name, str)
            and name.count(str(number)) == 0
        ):
            continue

        # if the number is preseent in the file name, we must determine if it's part of a different number
        num_index = file_name.index(str(number))

        # if the number is at the beginning of the file name
        if num_index == 0:
            # check if the next character is a digit
            if file_name[num_index + len(str(number))].isdigit():
                continue

        # if the number is at the end of the file name
        elif num_index == len(file_name) - len(str(number)):
            # check if the previous character is a digit
            if file_name[num_index - 1].isdigit():
                continue

        # if it's somewhere in the middle
        else:
            # check if the previous and next characters are digits
            if (
                file_name[num_index - 1].isdigit()
                or file_name[num_index + len(str(number))].isdigit()
            ):
                continue

        print(file_name)
        filtered_files.append(file_name)

    return filtered_files


output = number_filter(files, 4)

for file in output:
    assert file in desired_output

for file in desired_output:
    assert file in output

Answered By: CpE_Sklarr

Judging by your inputs, your desired regular expression needs to meet the following criteria:

  1. Match the number provided, exactly
  2. Ignore number matches in the file extension, if present
  3. Handle file names that include spaces

I think this will meet all these requirements:

def generate(n):
    return re.compile(r'^[^.d]*' + str(n) + r'[^.d]*(..*)?$')

def check_files(n, files):
    regex = generate(n)
    return [f for f in files if regex.fullmatch(f)]

Usage:

>>> check_files(4, ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4'])
['ep 4', 'img4', '4xxx', 'file 4.mp4']

Note that this solution involves creating a Pattern object and using that object to check each file. This strategy offers a performance benefit over calling re.fullmatch with the pattern and filename directly, as the pattern does not have to be compiled for each call.

This solution does have one drawback: it assumes that filenames are formatted as name.extension and that the value you’re searching for is in the name part. Because of the greedy nature of regular expressions, if you allow for file names with . then you won’t be able to exclude extensions from the search. Ergo, modifying this to match ep.4 would also cause it to match file.mp4. That being said, there is a workaround for this, which is to strip extensions from the file name before doing the match:

def generate(n):
    return re.compile(r'^[^d]*' + str(n) + r'[^d]*$')

def strip_extension(f):
    return f.removesuffix('.mp4')

def check_files(n, files):
    regex = generate(n)
    return [f for f in files if regex.fullmatch(strip_extension(f))]

Note that this solution now includes the . in the match condition and does not exclude an extension. Instead, it relies on preprocessing (the strip_extension function) to remove any file extensions from the filename before matching.

As an addendum, occasionally you’ll get files have the number prefixed with zeroes (ex. 004, 0001, etc.). You can modify the regular expression to handle this case as well:

def generate(n):
    return re.compile(r'^[^d]*0*' + str(n) + r'[^d]*$')
Answered By: Woody1193
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.