Benefits of os.path.splitext over regular .split?

Question:

In this other question, the votes clearly show that the os.path.splitext function is preferred over the simple .split('.')[-1] string manipulation. Does anyone have a moment to explain exactly why that is? Is it faster, or more accurate, or what? I’m willing to accept that there’s something better about it, but I can’t immediately see what it might be. Might importing a whole module to do this be overkill, at least in simple cases?

EDIT: The OS specificity is a big win that’s not immediately obvious; but even I should’ve seen the “what if there isn’t a dot” case! And thanks to everybody for the general comments on library usage.

Asked By: Jenn D.

Source

Answers:

In the comment to the answer that provided this solution:

“If the file has no extension this incorrectly returns the file name instead of an empty string.”

Not every file has an extension.

Answered By: chryss

A clearly defined and documented method to get the file extension would always be preferred over splitting a string willy nilly because that method would be more fragile for various reasons.

Edit: This is not language specific.

Answered By: Allen Rice

os.path.splitext will correctly handle the situation where the file has no extension and return an empty string. .split will return the name of the file.

Answered By: Andrew Grant

The first and most obvious difference is that the split call has no logic in it to default when there is no extension.

This can also be accomplished by a regex, in order to get it to behave as a 1 liner without extra includes, but still return, empty string if the extension isn’t there.

Also, the path library can handle different contexts for paths having different seperators for folders.

Answered By: DevelopingChris

There exist operating systems that do not use ‘.’ as an extension separator.

(Notably, RISC OS by convention uses ‘/’, since ‘.’ is used there as a path separator.)

Answered By: bobince

Right tool for the right job
Already thoroughly debugged and tested as part of the Python standard library – no bugs introduced by mistakes in your hand-rolled version (e.g. what if there is no extension, or the file is a hidden file on UNIX like ‘.bashrc’, or there are multiple extensions?)
Being designed for this purpose, the function has useful return values (basename, ext) for the filename passed, which can be more useful in certain cases versus having to split the path manually (again, edge cases could be a concern when figuring out the basename – ext)

The only reason to worry about importing the module is concern for overhead – that’s not likely to be a concern in the vast majority of cases, and if it is that tight then it’s likely other overhead in Python will be a bigger problem before that.

Answered By: Jay

Well, there are separate implementations for separate operating systems. This means that if the logic to extract the extension of a file differs on Mac from that on Linux, this distinction will be handled by those things. I don’t know of any such distinction so there might be none.

Edit: @Brian comments that an example like /directory.ext/file would of course not work with a simple .split('.') call, and you would have to know both that directories can use extensions, as well as the fact that on some operating systems, forward slash is a valid directory separator.

This just emphasizes the use a library routine unless you have a good reason not to part of my answer.

Thanks @Brian.

Additionally, where a file doesn’t have an extension, you would have to build in logic to handle that case. And what if the thing you try to split is a directory name ending with a backslash? No filename nor an extension.

The rule should be that unless you have a specific reason not to use a library function that does what you want, use it. This will avoid you having to maintain and bugfix code others have perfectly good solutions to.

Answered By: Lasse V. Karlsen

splitext() does a reverse search for ‘.’ and returns the extension portion as soon as it finds it. split('.') will do a forward search for all ‘.’ characters and is therefore almost always slower. In other words splitext() is specifically written for returning the extension unlike split().

(see posixpath.py in the Python source if you want to examine the implementation).

Answered By: codelogic

Besides being standard and therefore guaranteed to be available, os.path.splitext:

Handles edge cases – like that of a missing extension.
Offers guarantees – Besides correctly returning the extension if one exists, it guarantees that root + ext will always return the full path.
Is cross-platform – in the Python source there are actually three different version of os.path, and they are called based on which operating system Python thinks you are on.
Is more readable – consider that your version requires users to know that arrays can be indexed with negative numbers.

btw, it should not be any faster.

Answered By: Kenan Banks

1) simple split(‘.’)[-1] won’t work correctly for the path as C:foo.barMakefile so you need to extract basename first with os.path.basename(), and even in this case it will fail to split file without extension correctly. os.path.splitext do this under the hood.

2) Despite the fact os.path.splitext is cross-platform solution it’s not ideal. Let’s looking at the special files with leading dot, e.g. .cvsignore, .bzrignore, .hgignore (they are very popular in some VCS as special files). os.path.splitext will return entire file name as extension, although it does not seems right for me. Because in this case name without extension is empty string. Although this is intended behavior of Python standard library, it’s may be not what user wants actually.

Answered By: bialix

I am not sure Python has been ported on the VMS platform, but assuming it did (*):

Filenames are typically of the format: $device-dir-subdir$filename.$type;$version (**)

I hope you realize that using a narrow-scope method, which is only influenced by the systems you are exposed on, isn’t optimal for long-term code maintainability and such practices are particularly detrimental to mixing and matching disparate software components in larger software projects.

Essentially, in the latter case the success probability (dependability) is akin to

R(t)=1-(1-Ri)^n

and you can now see how poor/incomplete software implementations result into buggy programs.
More broadly speaking, porting software is difficult precisely due to such bugs.

(*) hm, googling revealed quickly about: https://www.vmspython.org
(**) Check at the regex wars over here! https://stackoverflow.com/a/4465456/1574494

Answered By: fgeorgatos