Using File Extension Wildcards in os.listdir(path)

Question:

I have a directory of files that I am trying to parse using Python. I wouldn’t have a problem if they were all the same extension, but for whatever reason they are created with sequential numeric extensions after their original extension. For example: foo.log foo.log.1 foo.log.2 bar.log bar.log.1 bar.log.2 etc. On top of that, foo.log is in XML format, while bar.log is not. What’s the best route to take in order to read and parse only the foo.log.* and foo.log files? The bar.log files do not need to be read. Below is my code:

import os
from lxml import etree
path = 'C:/foo/bar//'
listing = os.listdir(path)
for files in listing:
    if files.endswith('.log'):
        print files
        data = open(os.path.join(path, files), 'rb').read()
        tree = etree.fromstring(data)
        search = tree.findall('.//QueueEntry')

This doesn’t work as it doesn’t read any .log.* files and the parser chokes on the files that are read, but are not in xml format. Thanks!

Asked By: Dryden Long

||

Answers:

Maybe the glob module can help you:

import glob

listing = glob.glob('C:/foo/bar/foo.log*')
for filename in listing:
    # do stuff
Answered By: stranac

This’ll give you bash-like regexes:

import glob
print(glob.glob("/tmp/o*"))

Alternatively, you could os.listdir the entire directory, and throw away files that don’t match a regex via the re module.

Answered By: dstromberg

What’s the best route to take in order to read and parse only the foo.log.* and foo.log files? The bar.log files do not need to be read.

Your code does this:

if files.endswith('.log'):

You’ve just translated your English description into Python a bit wrong. What you write in Python is: “read and parse only the *.log files”, meaning bar.log is included, and foo.log.1 is not.

But if you think for a second, you can translate your English description directly into Python:

if files == 'foo.log' or files.startswith('foo.log.'):

And if you think about it, as long as there are no files named foo.log. (with that extra dot) that you want to skip, you can collapse the two cases into one:

if files.startswith('foo.log'):

However, if you know anything about POSIX shells, foo.log* matches exactly the same thing. (That’s not true for Windows shells, where wildcards treat extensions specially, which is why you have to type *.* instead of *.) And Python comes with a module that does POSIX-style wildcards, even on Windows, called glob. See stranac’s answer for how to use this.

I think the glob answer is better than manually filtering listdir. It’s simpler, it’s a more direct match for what your question title says you want to do (just do exactly what you hoped would work with os.listdir, but with glob.glob instead), and it’s more flexible. So, unless you’re worried about getting confused by the two slightly different meanings of wildcards, I’d suggest accepting that instead of this one.

Answered By: abarnert

As several already mentioned: you could use glob.glob to find files using wildcards.
I can’t write a comment and it is a very old question, but… Someone suggested, the glob.glob can’t expand ~ in the path. So, you can use os.path.expanduser for it, and os.path.expandvars to expand environment variables.

Answered By: peyeco
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.