How to efficiently filter a large python list?

Question:

I have a relatively large array called allListings and want to filter out all rows where allListings[:][14] == listingID.

This is the code I am using:
tempRows = list(filter(lambda x: x[14] == listingID, allListings))

The filtering is repeated in a for loop for all different listingID

Profiling shows, that this line consumes 95% of the runtime in the loop.
Is there any other way to filter large arrays more efficiently?

Asked By: Francis111333

||

Answers:

As suggested in comments, you may want to sort and group by this column if you are performing multiple operations on it based on the value of that column.

>>> from itertools import groupby
>>> a = [[1, 2, 3, 5],
...      [4, 6, 2, 8],
...      [1, 5, 7, 9],
...      [3, 5, 8, 2]]
>>> b = sorted(a, key=lambda x: x[0])
>>> b
[[1, 2, 3, 5], [1, 5, 7, 9], [3, 5, 8, 2], [4, 6, 2, 8]]
>>> c = groupby(b, key=lambda x: x[0])
>>> c
<itertools.groupby object at 0x106b763e0>
>>> d = {k: list(v) for k, v in c}
>>> d
{1: [[1, 2, 3, 5], [1, 5, 7, 9]], 3: [[3, 5, 8, 2]], 4: [[4, 6, 2, 8]]}

Now, if you need all lists where the first element is 1, you simply need:

>>> d[1]
[[1, 2, 3, 5], [1, 5, 7, 9]]

Or if you wanted everything but 1 in that first position.

>>> [x for k, v in d.items() 
...    if k != 1 
...    for x in v] 
[[3, 5, 8, 2], [4, 6, 2, 8]]

This is obviously a simpler example, but should be easily applicable to your situation.

Answered By: Chris

I got about a 33% improvement by moving the filter to a cython file and compiling. The primary speedup I think is in eliminating the reload of listingID for each compare. Just a guess on that.

test.pyx

def all_listings_filter(list data, int listingID):
    return [row for row in data if row[14] == listingID]

command line

cython3 test.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python3.10 -o test.so test.c
Answered By: tdelaney
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.