A case insensitive string class in python

Question:

I need to perform case insensitive string comparisons in python in sets and dictionary keys. Now, to create sets and dict subclasses that are case insensitive proves surprisingly tricky (see: Case insensitive dictionary for ideas, note they all use lower – hey there’s even a rejected PEP, albeit its scope is a bit broader). So I went with creating a case insensitive string class (leveraging this answer by @AlexMartelli):

class CIstr(unicode):
    """Case insensitive with respect to hashes and comparisons string class"""

    #--Hash/Compare
    def __hash__(self):
        return hash(self.lower())
    def __eq__(self, other):
        if isinstance(other, basestring):
            return self.lower() == other.lower()
        return NotImplemented
    def __ne__(self, other): return not (self == other)
    def __lt__(self, other):
        if isinstance(other, basestring):
            return self.lower() < other.lower()
        return NotImplemented
    def __ge__(self, other): return not (self < other)
    def __gt__(self, other):
        if isinstance(other, basestring):
            return self.lower() > other.lower()
        return NotImplemented
    def __le__(self, other): return not (self > other)

I am fully aware that lower is not really enough to cover all cases of string comparisons in unicode but I am refactoring existing code that used a much clunkier class for string comparisons (memory and speed wise) which anyway used lower() – so I can amend this on a later stage – plus I am on python 2 (as seen by unicode). My questions are:

  • did I get the operators right ?

  • is this class enough for my purposes, given that I take care to construct keys in dicts and set elements as CIstr instances – my purposes being checking equality, containment, set differences and similar operations in a case insensitive way. Or am I missing something ?

  • is it worth it to cache the lower case version of the string (as seen for instance in this ancient python recipe: Case Insensitive Strings). This comment suggests that not – plus I want to have construction as fast as possible and size as small as possible but people seem to include this.

Python 3 compatibility tips are appreciated !

Tiny demo:

d = {CIstr('A'): 1, CIstr('B'): 2}
print 'a' in d # True
s = set(d)
print {'a'} - s # set([])
Asked By: Mr_and_Mrs_D

||

Answers:

The code mostly looks fine. I would eliminate the short-cut’s in __ge__, __le__, and __ne__ and expand them to call lower() directly.

The short-cut looks like what is done in `functools.total_ordering() but it just slows down the code and makes it harder to test cross-type comparisons which are tricky to get right when the methods are interdependent.

Answered By: Raymond Hettinger

In your demo you are using 'a' to look stuff up in your set. It wouldn’t work if you tried to use 'A', because 'A' has a different hash. Also 'A' in d.keys() would be true, but 'A' in d would be false. You’ve essentially created a type that violates the normal contract of all hashes, by claiming to be equal to objects that have different hashes.

You could combine this answer with the answers about creating specialised dicts, and have a dict that converted any possible key into CIstr before trying to look it up. Then all your CIstr conversions could be hidden away inside the dictionary class.

E.g.

class CaseInsensitiveDict(dict):
    def __setitem__(self, key, value):
        super(CaseInsensitiveDict, self).__setitem__(convert_to_cistr(key), value)
    def __getitem__(self, key):
        return super(CaseInsensitiveDict, self).__getitem__(convert_to_cistr(key))
    # __init__, __contains__ etc.

(Based on https://stackoverflow.com/a/2082169/3890632)

Answered By: khelwood

If someone was looking for python 3 solution, one of the most clean and easy ways to solve this is to define a lowercase string class:

>>> class lcstr(str):
...     """Lowercase string"""
...     def __new__(cls, v) -> 'lcstr':
...         return super().__new__(cls, v.lower())
... 
>>> lcstr('Any STRING')
'any string'
>>> type(_)
<class '__main__.lcstr'>

And then just put it in a dict as is:

>>> {lcstr('ONE'): 1}
{'one': 1}
Answered By: grihabor