lower() vs. casefold() in string matching and converting to lowercase

Question:

How do I do a case-insensitive string comparison?

From what I understood from Google and the link above that both functions: lower() and casefold() will convert the string to lowercase, but casefold() will convert even the caseless letters such as the ß in German to ss.

All of that about Greek letters, but my question in general:

  • are there any other differences?
  • which one is better to convert to lowercase?
  • which one is better to check the matching strings?

Part 2:

firstString = "der Fluß"
secondString = "der Fluss"

# ß is equivalent to ss
if firstString.casefold() == secondString.casefold():
    print('The strings are equal.')
else:
    print('The strings are not equal.')

In the example above should I use:

lower() # the result is not equal which make sense to me

Or:

casefold() # which ß is ss and result is the
        # strings are equal. (since I am a beginner that still does not
        # make sense to me. I see different strings).
Asked By: user8454691

||

Answers:

TL;DR

  • Converting to Lowercase -> lower()
  • Caseless String matching/comparison -> casefold()

casefold() is a text normalization function like lower() that is specifically designed to remove upper- or lower-case distinctions for the purposes of comparison. It is another form of normalizing text that may initially appear to be very similar to lower() because generally, the results are the same. As of Unicode 13.0.0, only ~300 of ~150,000 characters produced differing results when passed through lower() and casefold(). @dlukes’ answer has the code to identify the characters that generate those differing results.

To answer your other two questions:

  • use lower() when you specifically want to ensure a character is lowercase, like for presenting to users or persisting data
  • use casefold() when you want to compare that result to another casefold-ed value.

Other Material

I suggest you take a closer look into what case folding actually is, so here’s a good start: W3 Case Folding Wiki

Another source:
Elastic.co Case Folding

Edit: I just recently found another very good related answer to a slightly different question here on SO (doing a case-insensitive string comparison)


Performance

Using this snippet, you can get a sense for the performance between the two:

import sys
from timeit import timeit

unicode_codepoints = tuple(map(chr, range(sys.maxunicode)))

def compute_lower():
    return tuple(codepoint.lower() for codepoint in unicode_codepoints)

def compute_casefold():
    return tuple(codepoint.casefold() for codepoint in unicode_codepoints)

timer_repeat = 1000

print(f"time to compute lower on unicode namespace: {timeit(compute_lower, number = timer_repeat) / timer_repeat} seconds")
print(f"time to compute casefold on unicode namespace: {timeit(compute_casefold, number = timer_repeat) / timer_repeat} seconds")

print(f"number of distinct characters from lower: {len(set(compute_lower()))}")
print(f"number of distinct characters from casefold: {len(set(compute_casefold()))}")

Running this, you’ll get the results that the two are overwhelmingly the same in both performance and the number of distinct characters returned

time to compute lower on unicode namespace: 0.137255663 seconds
time to compute casefold on unicode namespace: 0.136321374 seconds
number of distinct characters from lower: 1112719
number of distinct characters from casefold: 1112694

If you run the numbers, that means it takes about 1.6e-07 seconds to run the computation on a single character for either function, so there isn’t a performance benefit either way.

Answered By: David Culbreth

lower() vs casefold() and when to use, Details are given below.

str.lower str.casefold
It identifies the ASCII characters in the given string and converts them to lower case. It identifies the Unicode characters in the given string and converts them to lower case.
The ASCII standard contains 256 characters. The Unicode standard contains 143,859 characters.
It’s less effective while comparing two strings, as it can only lowercase 256 characters. It’s more effective while comparing two strings, as it has a wide range of characters which can be converted to lowercase
Answered By: Arpan Saini

Both .lower() and .casefold() act on the full range of Unicode codepoints

There’s some confusion in the existing answers, even the accepted one (EDIT: I was referring to this currently outdated version; the current one is fine). The distinction between .lower() and .casefold() has nothing to do with ASCII vs. Unicode, both act on the whole Unicode range of codepoints, just in slightly different ways. But both perform relatively complex mappings which they need to look up in the Unicode database, for instance:

>>> "Ť".lower()
'ť'

Both can involve single-to-multiple codepoint mappings, like we saw with "ß".casefold(). Look what happens to ß when you apply .lower()‘s counterpart .upper():

>>> "ß".upper()
'SS'

And the one example I found where .lower() also does this:

>>> list("İ".lower())
['i', '̇']

So the performance claims, like "lower() will require less memory or less time because there are no lookups, and it’s only dealing with 26 characters it has to transform", are simply not true.

The vast majority of the time, both operations yield the same thing, but there are a few cases (297 as of Unicode 13.0.0) where they don’t. You can identify them like this:

import sys
import unicodedata as ud

print("Unicode version:", ud.unidata_version, "n")
total = 0
for codepoint in map(chr, range(sys.maxunicode)):
    lower, casefold = codepoint.lower(), codepoint.casefold()
    if lower != casefold:
        total += 1
        for conversion, converted in zip(
            ("orig", "lower", "casefold"),
            (codepoint, lower, casefold)
        ):
            print(conversion, [ud.name(cp) for cp in converted], converted)
        print()
print("Total differences:", total)

When to use which

The Unicode standard covers lowercasing as part of Default Case Conversion in Section 3.13, and Default Case Folding is described right below that. The first paragraph says:

Case folding is related to case conversion. However, the main purpose of case folding is to contribute to caseless matching of strings, whereas the main purpose of case conversion is to put strings into a particular cased form.

My rule of thumb based on this:

  • Want to display a lowercased version of a string to users? Use .lower().
  • Want to do case-insensitive string comparison? Use .casefold().

(As a sidenote, I routinely break this rule of thumb and use .lower() across the board, just because it’s shorter to type, the output is overwhelmingly the same, and what differences there are don’t affect the languages I typically come across and work with. Don’t be like me though 😉 )

Just to hammer home that in terms of complexity, both operations are basically the same, they just use slightly different mappings — this is Unicode’s abstract definition of lowercasing:

R2 toLowercase(X): Map each character C in X to Lowercase_Mapping(C).

And this is its abstract definition of case folding:

R4 toCasefold(X): Map each character C in X to Case_Folding(C).

In Python’s official documentation

The Python docs are quite clear that this is what the respective methods do, they even point the user to the aforementioned Section 3.13.

They describe .lower() as converting cased characters to lowercase, where cased characters are "those with general category property being one of “Lu” (Letter, uppercase), “Ll” (Letter, lowercase), or “Lt” (Letter, titlecase)". Same with .upper() and uppercase.

With .casefold(), the docs explicitly state that it’s meant for "caseless matching", and that it’s "similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string".

Answered By: dlukes
print("∑∂˜∂ˆ´ˆˆçµµ∂˚ß˚ø≤∑∑π".casefold())     #∑∂˜∂ˆ´ˆˆçμμ∂˚ss˚ø≤∑∑π
print("∑∂˜∂ˆ´ˆˆçµµ∂˚ß˚ø≤∑∑π".lower())        #∑∂˜∂ˆ´ˆˆçµµ∂˚ß˚ø≤∑∑π

Was playing around and only casefold found the character ‘ß’. Might just stick with casefold if its more accurate even by the slightest.

Answered By: yousefabuz17
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.