How do you identify a list of filepaths containing non-ascii characters?

Question:

I would like to print a list of file paths from a Windows directory which contain non-ascii characters. The files are located on deeply-nested subdirectories.

I have two pieces of the problem figured out:

  1. A conditional that can evaluate a character and determine it’s unicode code. Unicode codes greater than 128 are non-ascii characters:
if ord(i) > 128
  1. A script that can extract file paths from a directory recursively:
directory = "C:Temp"

[print(os.path.join(dp, f)) for dp, dn, filenames in os.walk(directory) for f in filenames]

I have tried to combine these two pieces of information in various ways:

[print(os.path.join(dp, f)) for dp, dn, filenames in os.walk(directory) for f in filenames if ord(f) < 128]

This doesn’t work because the input into ord is the filepath, not an individual character. So I’ve tried various ways of changing f into a list of strings:

  1. List comprehension
[print(os.path.join(dp, f)) for dp, dn, filenames in os.walk(directory) for f in filenames if ord([x for x in list(f)] < 128)]

Error code: TypeError: ‘<‘ not supported between instances of ‘list’ and ‘int’

  1. Unpacking
[print(os.path.join(dp, f)) for dp, dn, filenames in os.walk(directory) for f in filenames if ord([x for x in (*f)] < 128)]

Error code: SyntaxError: can’t use starred expression here`

  1. Adding a clause to the list comprehension:
[print(os.path.join(dp, f)) for dp, dn, filenames in os.walk(directory) for x in f in filenames if ord(x < 128)]

Error code: NameError: name ‘f’ is not defined`

In case it’s helpful, this piece of code works – it removes non-ascii characters from filepaths in the directory. I just want to have a list of the files it changes for my own sense of control over the process

def remove_non_ascii_1(text):

    return ''.join(i for i in text if ord(i) < 128)

[os.rename(os.path.join(dp, f), remove_non_ascii_1(os.path.join(dp, f))) for dp, dn, filenames in os.walk(directory) for f in filenames]

Asked By: oymonk

||

Answers:

def is_ascii_filename(fn):
    return all(ord(ch) < 128 for ch in fn)
Answered By: Mark Ransom