Create (sane/safe) filename from any (unsafe) string

Question:

I want to create a sane/safe filename (i.e. somewhat readable, no "strange" characters, etc.) from some random Unicode string (which might contain just anything).

(It doesn’t matter for me whether the function is Cocoa, ObjC, Python, etc.)


Of course, there might be infinite many characters which might be strange. Thus, it is not really a solution to have a blacklist and to add more and more to that list over the time.

I could have a whitelist. However, I don’t really know how to define it. [a-zA-Z0-9 .] is a start but I also want to accept unicode chars which can be displayed in a normal way.

Asked By: Albert

||

Answers:

Python:

for c in r'[]/;,><&*:%=+@!#^()|?^':
    filename = filename.replace(c,'')

(just an example of characters you will want to remove)
The r in front of the string makes sure the string is interpreted in it’s raw format, allowing you to remove backslash as well

Edit:
regex solution in Python:

import re
re.sub(r'[]/;,><&*:%=+@!#^()|?^', '', filename)
Answered By: Remi

Python:

"".join([c for c in filename if c.isalpha() or c.isdigit() or c==' ']).rstrip()

this accepts Unicode characters but removes line breaks, etc.

example:

filename = u"adnbla'{-+)(ç?"

gives: adblaç

edit
str.isalnum() does alphanumeric on one step. – comment from queueoverflow below. danodonovan hinted on keeping a dot included.

    keepcharacters = (' ','.','_')
    "".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()
Answered By: Remi

My requirements were conservative ( the generated filenames needed to be valid on multiple operating systems, including some ancient mobile OSs ). I ended up with:

    "".join([c for c in text if re.match(r'w', c)])

That white lists the alphanumeric characters ( a-z, A-Z, 0-9 ) and the underscore. The regular expression can be compiled and cached for efficiency, if there are a lot of strings to be matched. For my case, it wouldn’t have made any significant difference.

Answered By: Ngure Nyaga

There are a few reasonable answers here, but in my case I want to take something which is a string which might have spaces and punctuation and rather than just removing those, i would rather replace it with an underscore. Even though spaces are an allowable filename character in most OS’s they are problematic. Also, in my case if the original string contained a period I didn’t want that to pass through into the filename, or it would generate “extra extensions” that I might not want (I’m appending the extension myself)

def make_safe_filename(s):
    def safe_char(c):
        if c.isalnum():
            return c
        else:
            return "_"
    return "".join(safe_char(c) for c in s).rstrip("_")

print(make_safe_filename( "hello you crazy $#^#& 2579 people!!! : die!!!" ) + ".gif")

prints:

hello_you_crazy_______2579_people______die___.gif

Answered By: uglycoyote

No solutions here, only problems that you must consider:

  • what is your minimum maximum filename length? (e.g. DOS supporting only 8-11 characters; most OS don’t support >256 characters)

  • what filenames are forbidden in some context? (Windows still doesn’t support saving a file as CON.TXT — see https://blogs.msdn.microsoft.com/oldnewthing/20031022-00/?p=42073)

  • remember that . and .. have specific meanings (current/parent directory) and are therefore unsafe.

  • is there a risk that filenames will collide — either due to removal of characters or the same filename being used multiple times?

Consider just hashing the data and using the hexdump of that as a filename?

Answered By: Dragon

More or less what has been mentioned here with regexp, but in reverse (replace any NOT listed):

>>> import re
>>> filename = u"adnbla'{-+)(ç1?"
>>> re.sub(r'[^wd-]','_',filename)
u'ad_bla__-_____1_'
Answered By: Filipe Pina

Here is what I came with, being inspired by uglycoyote:

import time

def make_safe_filename(s):
    def safe_char(c):
        if c.isalnum() or c=='.':
            return c
        else:
            return "_"

    safe = ""
    last_safe=False
    for c in s:
      if len(safe) > 200:
        return safe + "_" + str(time.time_ns() // 1000000)

      safe_c = safe_char(c)
      curr_safe = c != safe_c
      if not last_safe or not curr_safe:
        safe += safe_c
      last_safe=curr_safe
    return safe

And to test:

print(make_safe_filename( "hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!" ) + ".gif")
Answered By: Martin Kunc

I admit there are two schools of thought regarding DIY vs dependencies. But I come from the firm school of thought that prefers not to reinvent wheels, and to see canonical approaches to simple tasks like this. To wit I am a fan of the pathvalidate library

https://pypi.org/project/pathvalidate/

Which includes a function sanitize_filename() which does what you’re after.

I would preference this to any one of the numerous home baked solutions. In the ideal I’d like to see a sanitizer in os.path which is sensitive to filesystem differences and does not do unnecessary sanitising. I imagine pathvalidate takes the conservative approach and produces valid filenames that can span at least NTFS and ext4 comfortably, but it’s hard to imagine it even bothers with old DOS constraints.

Answered By: Bernd Wechner

If you don’t mind to import other packages, then werkzeug has a method for sanitizing strings:

from werkzeug.utils import secure_filename

secure_filename("hello.exe")
'hello.exe'
secure_filename("/../../.ssh")
'ssh'
secure_filename("DROP TABLE")
'DROP_TABLE'

#fork bomb on Linux
secure_filename(": () {: |: &} ;:")
''

#delete all system files on Windows
secure_filename("del*.*")
'del'

https://pypi.org/project/Werkzeug/

Answered By: Anders_K

Another approach is to specify a replacement for any unwanted symbol. This way filename may look more readable.

>>> substitute_chars = {'/':'-', ' ':''}
>>> filename = 'Cedric_Kelly_12/10/2020 7:56 am_317168.pdf'
>>> "".join(substitute_chars.get(c, c) for c in filename)
'Cedric_Kelly_12-10-20207:56am_317168.pdf'
Answered By: Dmitry

The problem with many other answers is that they only deal with character substitutions; not other issues.

Here is a comprehensive universal solution. It handles all types of issues for you, including (but not limited too) character substitution. It should cover all the bases.

Works in Windows, *nix, and almost every other file system.

def txt2filename(txt, chr_set='printable'):
    """Converts txt to a valid filename.

    Args:
        txt: The str to convert.
        chr_set:
            'printable':    Any printable character except those disallowed on Windows/*nix.
            'extended':     'printable' + extended ASCII character codes 128-255
            'universal':    For almost *any* file system. '-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
    """

    FILLER = '-'
    MAX_LEN = 255  # Maximum length of filename is 255 bytes in Windows and some *nix flavors.

    # Step 1: Remove excluded characters.
    BLACK_LIST = set(chr(127) + r'<>:"/|?*')                           # 127 is unprintable, the rest are illegal in Windows.
    white_lists = {
        'universal': {'-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'},
        'printable': {chr(x) for x in range(32, 127)} - BLACK_LIST,     # 0-32, 127 are unprintable,
        'extended' : {chr(x) for x in range(32, 256)} - BLACK_LIST,
    }
    white_list = white_lists[chr_set]
    result = ''.join(x
                     if x in white_list else FILLER
                     for x in txt)

    # Step 2: Device names, '.', and '..' are invalid filenames in Windows.
    DEVICE_NAMES = 'CON,PRN,AUX,NUL,COM1,COM2,COM3,COM4,' 
                   'COM5,COM6,COM7,COM8,COM9,LPT1,LPT2,' 
                   'LPT3,LPT4,LPT5,LPT6,LPT7,LPT8,LPT9,' 
                   'CONIN$,CONOUT$,..,.'.split()  # This list is an O(n) operation.
    if result in DEVICE_NAMES:
        result = f'{FILLER}{result}{FILLER}'

    # Step 3: Truncate long files while preserving the file extension.
    if len(result) > MAX_LEN:
        if '.' in txt:
            result, _, ext = result.rpartition('.')
            ext = '.' + ext
        else:
            ext = ''
        result = result[:MAX_LEN - len(ext)] + ext

    # Step 4: Windows does not allow filenames to end with '.' or ' ' or begin with ' '.
    result = re.sub(r'^[. ]', FILLER, result)
    result = re.sub(r' $', FILLER, result)

    return result

It replaces non-printable characters even if they are technically valid filenames because they are not always simple to deal with.

No external libraries needed.

Answered By: ChaimG

I don’t recommend using any of the other answers. They’re bloated, use bad techniques, and replace tons of legal characters (some even removed all Unicode characters, which is nuts since they’re legal in filenames). A few of them even import huge libraries just for this tiny, easy job… that’s crazy.

Here’s a regex one-liner which efficiently replaces every illegal filesystem character and nothing else. No libraries, no bloat, just a perfectly legal filename in one simple command.

Reference: https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words

Regex:

clean = re.sub(r"[/\?%*:|"<>x7Fx00-x1F]", "-", dirty)

Usage:

import re

# Here's a dirty, illegal filename full of control-characters and illegal chars.
dirty = "".join(["\[/\?%*:|"<>0x7F0x00-0x1F]", chr(0x1F) * 15])

# Clean it in one fell swoop.
clean = re.sub(r"[/\?%*:|"<>x7Fx00-x1F]", "-", dirty)

# Result: "-[----------0x7F0x00-0x1F]---------------"
print(clean)

This was an extreme example where almost every character is illegal, because we constructed the dirty string with the same list of characters that the regex removes, and we even padded with a bunch of "0x1F (ascii 31)" at the end just to show that it also removes illegal control-characters.

This is it. This regex is the only answer you need. It handles every illegal character on modern filesystems (Mac, Windows and Linux). Removing anything more beyond this would fall under the category of "beautifying" and has nothing to do with making legal disk filenames.


More work for Windows users:

After you’ve run this command, you could optionally also check the result against the list of "special device names" on Windows (a case-insensitive list of words such as "CON", "AUX", "COM0", etc).

The illegal words can be found at https://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations in the "Reserved words" and "Comments" columns for the NTFS and FAT filesystems.

Filtering reserved words is only necessary if you plan to store the file on a NTFS or FAT-style disk. Because Windows reserves certain "magic filenames" for internal usage. It reserves them case-insensitively and without caring about the extension, meaning that for example aux.c is an illegal filename on Windows (very silly).

All Mac/Linux filesystems don’t have silly limitations like that, so you don’t have to do anything else if you’re on a good filesystem. Heck, in fact, most of the "illegal characters" we filtered out in the regex are Windows-specific limitations. Mac/Linux filesystems can store most of them. But we filter them anyway since it makes the filenames portable to Windows machines.

Answered By: Mitch McMabers

Extra note for all other answers

Add hash of original string to the end of filename. It will prevent conflicts
in case your conversion makes same filename from different strings.

Answered By: Alexander C
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.