In Python, find item in list of dicts, using bisect

Question:

I have a list of dicts, something like this:

test_data = [
    { 'offset':0, 'data':1500 },
    { 'offset':1270, 'data':120 },
    { 'offset':2117, 'data':30 },
    { 'offset':4055, 'data':30000 },
]

The dict items are sorted in the list according to the 'offset' data. The real data could be much longer.

What I want to do is lookup an item in the list given a particular offset value, which is not exactly one of those values, but in that range. So, a binary search is what I want to do.

I am now aware of the Python bisect module, which is a ready-made binary search—great, but not directly usable for this case. I’m just wondering what is the easiest way to adapt bisect to my needs. Here is what I came up with:

import bisect

class dict_list_index_get_member(object):
    def __init__(self, dict_list, member):
        self.dict_list = dict_list
        self.member = member
    def __getitem__(self, index):
        return self.dict_list[index][self.member]
    def __len__(self):
        return self.dict_list.__len__()

test_data_index_get_offset = dict_list_index_get_member(test_data, 'offset')
print bisect.bisect(test_data_index_get_offset, 1900)

It prints:

2

My question is, is this the best way to do what I want, or is there some other simpler, better way?

Asked By: Craig McQueen

||

Answers:

The usual pattern here is similar to sorting by an attribute, decorate, operate, and undecorate. So in this case you’d just need to decorate and then call. However you’d want to avoid doing this since decorate would be O(n) whereas you want this to be O(logn). Therefore I’d consider your method best.

Answered By: Alex Gaynor

When you say the real data could be much longer, does that prevent you from keeping a list of offset values on hand?

offset_values = [i['offset'] for i in test_data]
bisect.bisect(offset_values, 1900)

Your method seems fine to me though.

Answered By: sykora

What you can do is this

class OffsetWithAttributes( object ):
    def __init__( self, offset, **kw ):
        self.offset= offset
        self.attributes= kw
    def __eq__( self, other ):
        return self.offset == other.offset
    def __lt__( self, other ):
        return self.offset < other.offset
    def __le__( self, other ):
        return self.offset <= other.offset
    def __gt__( self, other ):
        return self.offset > other.offset
    def __ge__( self, other ):
        return self.offset >= other.offset
    def __ne__( self, other ):
        return self.offset != other.offset

This should allow you to create a simple list of OffsetWithAttributes instances. The bisect algorithm should be perfectly happy to use the defined operators.

You can use your someOWA.attributes['data'].

Or

    def __getattr__( self, key ):
        return self.attributes[key]

That should make the OffsetWithAttributes more like a dict.

Answered By: S.Lott

You could also use one of Python’s many SortedDict implementations to manage your test_data. A sorted dict sorts the elements by key and maintains a mapping to a value. Some implementations also support a bisect operation on the keys. For example, the Python sortedcontainers module has a SortedDict that meets your requirements.

In your case it would look something like:

from sortedcontainers import SortedDict
offset_map = SortedDict((item['offset'], item['data']) for item in test_data)
index = offset_map.bisect(1275)
key = offset_map.iloc[index]
print offset_map[key]
# 120

The SortedDict type has a bisect function which returns the bisected index of the desired key. With that index, you can lookup the actual key. And with that key you can get the value.

All of these operations are very fast in sortedcontainers which is also conveniently implemented in pure-Python. There’s a performance comparison too which discusses other choices and has benchmark data.

Answered By: GrantJ

tuples work with bisect if you are ok using them instead…

import bisect

offset = 0
data = 1
test_data = [
    (0, 1500),
    (1270, 120),
    (2117, 30),
    (4055, 30000),
]

i = bisect.bisect(test_data, (1900,0))
test_data.insert(i, (1900,0))
print(test_data[i][data])

although because tuples are compared “lexicographically” (left to right) until an element is not equal to the other – you would have to consider if this is desired behaviour

>>> bisect.insort(test_data, (2117,29))
>>> print(test_data)
[(0, 1500), (1270, 120), (2117, 29), (2117, 30), (4055, 30000)]
Answered By: uosjead

Starting in Python 3.10 you can pass a key function as a keyword argument to the bisect functions

>>> bisect.bisect(test_data, 1900, key=lambda x: x["offset"])
2
Answered By: Will Ayd

For range queries over a list of dicts, ducks will perform well. It’s as fast as a binary search because it builds a tree-based index.

pip install ducks

from ducks import Dex

test_data = [
    { 'offset':0, 'data':1500 },
    { 'offset':1270, 'data':120 },
    { 'offset':2117, 'data':30 },
    { 'offset':4055, 'data':30000 },
]

# build index on 'offset'
dex = Dex(test_data, ['offset'])

dex[{'offset': {'>': 1900}}] 
# result: [{'offset': 2117, 'data': 30}, {'offset': 4055, 'data': 30000}]

Ducks can also search by multiple attributes, e.g.:

# build a Dex on 'offset' and 'data'
dex = Dex(test_data, ['offset', 'data'])
dex[{'offset': {'>': 1900}, 'data': {'<': 50}}]
# result: [{'offset': 2117, 'data': 30}]
Answered By: manimino
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.