Key-ordered dict in Python
Question:
I am looking for a solid implementation of an ordered associative array, that is, an ordered dictionary. I want the ordering in terms of keys, not of insertion order.
More precisely, I am looking for a space-efficent implementation of a int-to-float (or string-to-float for another use case) mapping structure for which:
- Ordered iteration is O(n)
- Random access is O(1)
The best I came up with was gluing a dict and a list of keys, keeping the last one ordered with bisect and insert.
Any better ideas?
Answers:
An ordered tree is usually better for this cases, but random access is going to be log(n). You should keep into account also insertion and removal costs…
You could build a dict that allows traversal by storing a pair (value, next_key)
in each position.
Random access:
my_dict[k][0] # for a key k
Traversal:
k = start_key # stored somewhere
while k is not None: # next_key is None at the end of the list
v, k = my_dict[k]
yield v
Keep a pointer to start
and end
and you’ll have efficient update for those cases where you just need to add onto the end of the list.
Inserting in the middle is obviously O(n). Possibly you could build a skip list on top of it if you need more speed.
I’m not sure which python version are you working in, but in case you like to experiment, Python 3.1 includes and official implementation of Ordered dictionaries:
http://www.python.org/dev/peps/pep-0372/
http://docs.python.org/3.1/whatsnew/3.1.html#pep-372-ordered-dictionaries
here’s a pastie: I Had a need for something similar. Note however that this specific implementation is immutable, there are no inserts, once the instance is created: The exact performance doesn’t quite match what you’re asking for, however. Lookup is O(log n) and full scan is O(n). This works using the bisect
module upon a tuple of key/value (tuple) pairs. Even if you can’t use this precisely, you might have some success adapting it to your needs.
import bisect
class dictuple(object):
"""
>>> h0 = dictuple()
>>> h1 = dictuple({"apples": 1, "bananas":2})
>>> h2 = dictuple({"bananas": 3, "mangoes": 5})
>>> h1+h2
('apples':1, 'bananas':3, 'mangoes':5)
>>> h1 > h2
False
>>> h1 > 6
False
>>> 'apples' in h1
True
>>> 'apples' in h2
False
>>> d1 = {}
>>> d1[h1] = "salad"
>>> d1[h1]
'salad'
>>> d1[h2]
Traceback (most recent call last):
...
KeyError: ('bananas':3, 'mangoes':5)
"""
def __new__(cls, *args, **kwargs):
initial = {}
args = [] if args is None else args
for arg in args:
initial.update(arg)
initial.update(kwargs)
instance = object.__new__(cls)
instance.__items = tuple(sorted(initial.items(),key=lambda i:i[0]))
return instance
def __init__(self,*args, **kwargs):
pass
def __find(self,key):
return bisect.bisect(self.__items, (key,))
def __getitem__(self, key):
ind = self.__find(key)
if self.__items[ind][0] == key:
return self.__items[ind][1]
raise KeyError(key)
def __repr__(self):
return "({0})".format(", ".join(
"{0}:{1}".format(repr(item[0]),repr(item[1]))
for item in self.__items))
def __contains__(self,key):
ind = self.__find(key)
return self.__items[ind][0] == key
def __cmp__(self,other):
return cmp(self.__class__.__name__, other.__class__.__name__
) or cmp(self.__items, other.__items)
def __eq__(self,other):
return self.__items == other.__items
def __format__(self,key):
pass
#def __ge__(self,key):
# pass
#def __getattribute__(self,key):
# pass
#def __gt__(self,key):
# pass
__seed = hash("dictuple")
def __hash__(self):
return dictuple.__seed^hash(self.__items)
def __iter__(self):
return self.iterkeys()
def __len__(self):
return len(self.__items)
#def __reduce__(self,key):
# pass
#def __reduce_ex__(self,key):
# pass
#def __sizeof__(self,key):
# pass
@classmethod
def fromkeys(cls,key,v=None):
cls(dict.fromkeys(key,v))
def get(self,key, default):
ind = self.__find(key)
return self.__items[ind][1] if self.__items[ind][0] == key else default
def has_key(self,key):
ind = self.__find(key)
return self.__items[ind][0] == key
def items(self):
return list(self.iteritems())
def iteritems(self):
return iter(self.__items)
def iterkeys(self):
return (i[0] for i in self.__items)
def itervalues(self):
return (i[1] for i in self.__items)
def keys(self):
return list(self.iterkeys())
def values(self):
return list(self.itervalues())
def __add__(self, other):
_sum = dict(self.__items)
_sum.update(other.__items)
return self.__class__(_sum)
if __name__ == "__main__":
import doctest
doctest.testmod()
“Random access O(1)” is an extremely exacting requirement which basically imposes an underlying hash table — and I hope you do mean random READS only, because I think it can be mathematically proven than it’s impossible in the general case to have O(1) writes as well as O(N) ordered iteration.
I don’t think you will find a pre-packaged container suited to your needs because they are so extreme — O(log N) access would of course make all the difference in the world. To get the big-O behavior you want for reads and iterations you’ll need to glue two data structures, essentially a dict and a heap (or sorted list or tree), and keep them in sync. Although you don’t specify, I think you’ll only get amortized behavior of the kind you want – unless you’re truly willing to pay any performance hits for inserts and deletes, which is the literal implication of the specs you express but does seem a pretty unlikely real-life requirement.
For O(1) read and amortized O(N) ordered iteration, just keep a list of all keys on the side of a dict. E.g.:
class Crazy(object):
def __init__(self):
self.d = {}
self.L = []
self.sorted = True
def __getitem__(self, k):
return self.d[k]
def __setitem__(self, k, v):
if k not in self.d:
self.L.append(k)
self.sorted = False
self.d[k] = v
def __delitem__(self, k):
del self.d[k]
self.L.remove(k)
def __iter__(self):
if not self.sorted:
self.L.sort()
self.sorted = True
return iter(self.L)
If you don’t like the “amortized O(N) order” you can remove self.sorted and just repeat self.L.sort()
in __setitem__
itself. That makes writes O(N log N), of course (while I still had writes at O(1)). Either approach is viable and it’s hard to think of one as intrinsically superior to the other. If you tend to do a bunch of writes then a bunch of iterations then the approach in the code above is best; if it’s typically one write, one iteration, another write, another iteration, then it’s just about a wash.
BTW, this takes shameless advantage of the unusual (and wonderful;-) performance characteristics of Python’s sort (aka “timsort”): among them, sorting a list that’s mostly sorted but with a few extra items tacked on at the end is basically O(N) (if the tacked on items are few enough compared to the sorted prefix part). I hear Java’s gaining this sort soon, as Josh Block was so impressed by a tech talk on Python’s sort that he started coding it for the JVM on his laptop then and there. Most sytems (including I believe Jython as of today and IronPython too) basically have sorting as an O(N log N) operation, not taking advantage of “mostly ordered” inputs; “natural mergesort”, which Tim Peters fashioned into Python’s timsort of today, is a wonder in this respect.
Here is my own implementation:
import bisect
class KeyOrderedDict(object):
__slots__ = ['d', 'l']
def __init__(self, *args, **kwargs):
self.l = sorted(kwargs)
self.d = kwargs
def __setitem__(self, k, v):
if not k in self.d:
idx = bisect.bisect(self.l, k)
self.l.insert(idx, k)
self.d[k] = v
def __getitem__(self, k):
return self.d[k]
def __delitem__(self, k):
idx = bisect.bisect_left(self.l, k)
del self.l[idx]
del self.d[k]
def __iter__(self):
return iter(self.l)
def __contains__(self, k):
return k in self.d
The use of bisect keeps self.l ordered, and insertion is O(n) (because of the insert, but not a killer in my case, because I append far more often than truly insert, so the usual case is amortized O(1)). Access is O(1), and iteration O(n). But maybe someone had invented (in C) something with a more clever structure ?
The ordereddict package ( http://anthon.home.xs4all.nl/Python/ordereddict/ ) that I implemented back in 2007 includes sorteddict. sorteddict is a KSO ( Key Sorted Order) dictionary. It is implemented in C and very space efficient and several times faster than a pure Python implementation. Downside is that only works with CPython.
>>> from _ordereddict import sorteddict
>>> x = sorteddict()
>>> x[1] = 1.0
>>> x[3] = 3.3
>>> x[2] = 2.2
>>> print x
sorteddict([(1, 1.0), (2, 2.2), (3, 3.3)])
>>> for i in x:
... print i, x[i]
...
1 1.0
2 2.2
3 3.3
>>>
Sorry for the late reply, maybe this answer can help others find that library.
For “string to float” problem you can use a Trie – it provides O(1) access time and O(n) sorted iteration. By “sorted” I mean “sorted alphabetically by key” – it seems that the question implies the same.
Some implementations (each with its own strong and weak points):
- https://github.com/biopython/biopython has Bio.trie module with a full-featured Trie; other Trie packages are more memory-effcient;
- https://github.com/kmike/datrie – random insertions could be slow, keys alphabet must be known in advance;
- https://github.com/kmike/hat-trie – all operations are fast, but many dict methods are not implemented; underlying C library supports sorted iteration, but it is not implemented in a wrapper;
- https://github.com/kmike/marisa-trie – very memory efficient, but doesn’t support insertions; iteration is not sorted by default but can be made sorted (there is an example in docs);
- https://github.com/kmike/DAWG – can be seen as a minimized Trie; very fast and memory efficient, but doesn’t support insertions; has size limits (several GB of data)
The sortedcontainers module provides a SortedDict type that meets your requirements. It basically glues a SortedList and dict type together. The dict provides O(1) lookup and the SortedList provides O(N) iteration (it’s extremely fast). The whole module is pure-Python and has benchmark graphs to backup the performance claims (fast-as-C implementations). SortedDict is also fully tested with 100% coverage and hours of stress. It’s compatible with Python 2.6 through 3.4.
Here’s one option that has not been mentioned in other answers, I think:
- Use a binary search tree (Treap/AVL/RB) to keep the mapping.
- Also use a hashmap (aka dictionary) to keep the same mapping (again).
This will provide O(n) ordered traversal (via the tree), O(1) random access (via the hashmap) and O(log n) insertion/deletion (because you need to update both the tree and the hash).
The drawback is the need to keep all the data twice, however the alternatives which suggest keeping a list of keys alongside a hashmap are not much better in this sense.
I am looking for a solid implementation of an ordered associative array, that is, an ordered dictionary. I want the ordering in terms of keys, not of insertion order.
More precisely, I am looking for a space-efficent implementation of a int-to-float (or string-to-float for another use case) mapping structure for which:
- Ordered iteration is O(n)
- Random access is O(1)
The best I came up with was gluing a dict and a list of keys, keeping the last one ordered with bisect and insert.
Any better ideas?
An ordered tree is usually better for this cases, but random access is going to be log(n). You should keep into account also insertion and removal costs…
You could build a dict that allows traversal by storing a pair (value, next_key)
in each position.
Random access:
my_dict[k][0] # for a key k
Traversal:
k = start_key # stored somewhere
while k is not None: # next_key is None at the end of the list
v, k = my_dict[k]
yield v
Keep a pointer to start
and end
and you’ll have efficient update for those cases where you just need to add onto the end of the list.
Inserting in the middle is obviously O(n). Possibly you could build a skip list on top of it if you need more speed.
I’m not sure which python version are you working in, but in case you like to experiment, Python 3.1 includes and official implementation of Ordered dictionaries:
http://www.python.org/dev/peps/pep-0372/
http://docs.python.org/3.1/whatsnew/3.1.html#pep-372-ordered-dictionaries
here’s a pastie: I Had a need for something similar. Note however that this specific implementation is immutable, there are no inserts, once the instance is created: The exact performance doesn’t quite match what you’re asking for, however. Lookup is O(log n) and full scan is O(n). This works using the bisect
module upon a tuple of key/value (tuple) pairs. Even if you can’t use this precisely, you might have some success adapting it to your needs.
import bisect
class dictuple(object):
"""
>>> h0 = dictuple()
>>> h1 = dictuple({"apples": 1, "bananas":2})
>>> h2 = dictuple({"bananas": 3, "mangoes": 5})
>>> h1+h2
('apples':1, 'bananas':3, 'mangoes':5)
>>> h1 > h2
False
>>> h1 > 6
False
>>> 'apples' in h1
True
>>> 'apples' in h2
False
>>> d1 = {}
>>> d1[h1] = "salad"
>>> d1[h1]
'salad'
>>> d1[h2]
Traceback (most recent call last):
...
KeyError: ('bananas':3, 'mangoes':5)
"""
def __new__(cls, *args, **kwargs):
initial = {}
args = [] if args is None else args
for arg in args:
initial.update(arg)
initial.update(kwargs)
instance = object.__new__(cls)
instance.__items = tuple(sorted(initial.items(),key=lambda i:i[0]))
return instance
def __init__(self,*args, **kwargs):
pass
def __find(self,key):
return bisect.bisect(self.__items, (key,))
def __getitem__(self, key):
ind = self.__find(key)
if self.__items[ind][0] == key:
return self.__items[ind][1]
raise KeyError(key)
def __repr__(self):
return "({0})".format(", ".join(
"{0}:{1}".format(repr(item[0]),repr(item[1]))
for item in self.__items))
def __contains__(self,key):
ind = self.__find(key)
return self.__items[ind][0] == key
def __cmp__(self,other):
return cmp(self.__class__.__name__, other.__class__.__name__
) or cmp(self.__items, other.__items)
def __eq__(self,other):
return self.__items == other.__items
def __format__(self,key):
pass
#def __ge__(self,key):
# pass
#def __getattribute__(self,key):
# pass
#def __gt__(self,key):
# pass
__seed = hash("dictuple")
def __hash__(self):
return dictuple.__seed^hash(self.__items)
def __iter__(self):
return self.iterkeys()
def __len__(self):
return len(self.__items)
#def __reduce__(self,key):
# pass
#def __reduce_ex__(self,key):
# pass
#def __sizeof__(self,key):
# pass
@classmethod
def fromkeys(cls,key,v=None):
cls(dict.fromkeys(key,v))
def get(self,key, default):
ind = self.__find(key)
return self.__items[ind][1] if self.__items[ind][0] == key else default
def has_key(self,key):
ind = self.__find(key)
return self.__items[ind][0] == key
def items(self):
return list(self.iteritems())
def iteritems(self):
return iter(self.__items)
def iterkeys(self):
return (i[0] for i in self.__items)
def itervalues(self):
return (i[1] for i in self.__items)
def keys(self):
return list(self.iterkeys())
def values(self):
return list(self.itervalues())
def __add__(self, other):
_sum = dict(self.__items)
_sum.update(other.__items)
return self.__class__(_sum)
if __name__ == "__main__":
import doctest
doctest.testmod()
“Random access O(1)” is an extremely exacting requirement which basically imposes an underlying hash table — and I hope you do mean random READS only, because I think it can be mathematically proven than it’s impossible in the general case to have O(1) writes as well as O(N) ordered iteration.
I don’t think you will find a pre-packaged container suited to your needs because they are so extreme — O(log N) access would of course make all the difference in the world. To get the big-O behavior you want for reads and iterations you’ll need to glue two data structures, essentially a dict and a heap (or sorted list or tree), and keep them in sync. Although you don’t specify, I think you’ll only get amortized behavior of the kind you want – unless you’re truly willing to pay any performance hits for inserts and deletes, which is the literal implication of the specs you express but does seem a pretty unlikely real-life requirement.
For O(1) read and amortized O(N) ordered iteration, just keep a list of all keys on the side of a dict. E.g.:
class Crazy(object):
def __init__(self):
self.d = {}
self.L = []
self.sorted = True
def __getitem__(self, k):
return self.d[k]
def __setitem__(self, k, v):
if k not in self.d:
self.L.append(k)
self.sorted = False
self.d[k] = v
def __delitem__(self, k):
del self.d[k]
self.L.remove(k)
def __iter__(self):
if not self.sorted:
self.L.sort()
self.sorted = True
return iter(self.L)
If you don’t like the “amortized O(N) order” you can remove self.sorted and just repeat self.L.sort()
in __setitem__
itself. That makes writes O(N log N), of course (while I still had writes at O(1)). Either approach is viable and it’s hard to think of one as intrinsically superior to the other. If you tend to do a bunch of writes then a bunch of iterations then the approach in the code above is best; if it’s typically one write, one iteration, another write, another iteration, then it’s just about a wash.
BTW, this takes shameless advantage of the unusual (and wonderful;-) performance characteristics of Python’s sort (aka “timsort”): among them, sorting a list that’s mostly sorted but with a few extra items tacked on at the end is basically O(N) (if the tacked on items are few enough compared to the sorted prefix part). I hear Java’s gaining this sort soon, as Josh Block was so impressed by a tech talk on Python’s sort that he started coding it for the JVM on his laptop then and there. Most sytems (including I believe Jython as of today and IronPython too) basically have sorting as an O(N log N) operation, not taking advantage of “mostly ordered” inputs; “natural mergesort”, which Tim Peters fashioned into Python’s timsort of today, is a wonder in this respect.
Here is my own implementation:
import bisect
class KeyOrderedDict(object):
__slots__ = ['d', 'l']
def __init__(self, *args, **kwargs):
self.l = sorted(kwargs)
self.d = kwargs
def __setitem__(self, k, v):
if not k in self.d:
idx = bisect.bisect(self.l, k)
self.l.insert(idx, k)
self.d[k] = v
def __getitem__(self, k):
return self.d[k]
def __delitem__(self, k):
idx = bisect.bisect_left(self.l, k)
del self.l[idx]
del self.d[k]
def __iter__(self):
return iter(self.l)
def __contains__(self, k):
return k in self.d
The use of bisect keeps self.l ordered, and insertion is O(n) (because of the insert, but not a killer in my case, because I append far more often than truly insert, so the usual case is amortized O(1)). Access is O(1), and iteration O(n). But maybe someone had invented (in C) something with a more clever structure ?
The ordereddict package ( http://anthon.home.xs4all.nl/Python/ordereddict/ ) that I implemented back in 2007 includes sorteddict. sorteddict is a KSO ( Key Sorted Order) dictionary. It is implemented in C and very space efficient and several times faster than a pure Python implementation. Downside is that only works with CPython.
>>> from _ordereddict import sorteddict
>>> x = sorteddict()
>>> x[1] = 1.0
>>> x[3] = 3.3
>>> x[2] = 2.2
>>> print x
sorteddict([(1, 1.0), (2, 2.2), (3, 3.3)])
>>> for i in x:
... print i, x[i]
...
1 1.0
2 2.2
3 3.3
>>>
Sorry for the late reply, maybe this answer can help others find that library.
For “string to float” problem you can use a Trie – it provides O(1) access time and O(n) sorted iteration. By “sorted” I mean “sorted alphabetically by key” – it seems that the question implies the same.
Some implementations (each with its own strong and weak points):
- https://github.com/biopython/biopython has Bio.trie module with a full-featured Trie; other Trie packages are more memory-effcient;
- https://github.com/kmike/datrie – random insertions could be slow, keys alphabet must be known in advance;
- https://github.com/kmike/hat-trie – all operations are fast, but many dict methods are not implemented; underlying C library supports sorted iteration, but it is not implemented in a wrapper;
- https://github.com/kmike/marisa-trie – very memory efficient, but doesn’t support insertions; iteration is not sorted by default but can be made sorted (there is an example in docs);
- https://github.com/kmike/DAWG – can be seen as a minimized Trie; very fast and memory efficient, but doesn’t support insertions; has size limits (several GB of data)
The sortedcontainers module provides a SortedDict type that meets your requirements. It basically glues a SortedList and dict type together. The dict provides O(1) lookup and the SortedList provides O(N) iteration (it’s extremely fast). The whole module is pure-Python and has benchmark graphs to backup the performance claims (fast-as-C implementations). SortedDict is also fully tested with 100% coverage and hours of stress. It’s compatible with Python 2.6 through 3.4.
Here’s one option that has not been mentioned in other answers, I think:
- Use a binary search tree (Treap/AVL/RB) to keep the mapping.
- Also use a hashmap (aka dictionary) to keep the same mapping (again).
This will provide O(n) ordered traversal (via the tree), O(1) random access (via the hashmap) and O(log n) insertion/deletion (because you need to update both the tree and the hash).
The drawback is the need to keep all the data twice, however the alternatives which suggest keeping a list of keys alongside a hashmap are not much better in this sense.