How to efficiently filter a large python list?
Question:
I have a relatively large array called allListings
and want to filter out all rows where allListings[:][14] == listingID
.
This is the code I am using:
tempRows = list(filter(lambda x: x[14] == listingID, allListings))
The filtering is repeated in a for loop for all different listingID
Profiling shows, that this line consumes 95% of the runtime in the loop.
Is there any other way to filter large arrays more efficiently?
Answers:
As suggested in comments, you may want to sort and group by this column if you are performing multiple operations on it based on the value of that column.
>>> from itertools import groupby
>>> a = [[1, 2, 3, 5],
... [4, 6, 2, 8],
... [1, 5, 7, 9],
... [3, 5, 8, 2]]
>>> b = sorted(a, key=lambda x: x[0])
>>> b
[[1, 2, 3, 5], [1, 5, 7, 9], [3, 5, 8, 2], [4, 6, 2, 8]]
>>> c = groupby(b, key=lambda x: x[0])
>>> c
<itertools.groupby object at 0x106b763e0>
>>> d = {k: list(v) for k, v in c}
>>> d
{1: [[1, 2, 3, 5], [1, 5, 7, 9]], 3: [[3, 5, 8, 2]], 4: [[4, 6, 2, 8]]}
Now, if you need all lists where the first element is 1
, you simply need:
>>> d[1]
[[1, 2, 3, 5], [1, 5, 7, 9]]
Or if you wanted everything but 1 in that first position.
>>> [x for k, v in d.items()
... if k != 1
... for x in v]
[[3, 5, 8, 2], [4, 6, 2, 8]]
This is obviously a simpler example, but should be easily applicable to your situation.
I got about a 33% improvement by moving the filter to a cython file and compiling. The primary speedup I think is in eliminating the reload of listingID
for each compare. Just a guess on that.
test.pyx
def all_listings_filter(list data, int listingID):
return [row for row in data if row[14] == listingID]
command line
cython3 test.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python3.10 -o test.so test.c
I have a relatively large array called allListings
and want to filter out all rows where allListings[:][14] == listingID
.
This is the code I am using:
tempRows = list(filter(lambda x: x[14] == listingID, allListings))
The filtering is repeated in a for loop for all different listingID
Profiling shows, that this line consumes 95% of the runtime in the loop.
Is there any other way to filter large arrays more efficiently?
As suggested in comments, you may want to sort and group by this column if you are performing multiple operations on it based on the value of that column.
>>> from itertools import groupby
>>> a = [[1, 2, 3, 5],
... [4, 6, 2, 8],
... [1, 5, 7, 9],
... [3, 5, 8, 2]]
>>> b = sorted(a, key=lambda x: x[0])
>>> b
[[1, 2, 3, 5], [1, 5, 7, 9], [3, 5, 8, 2], [4, 6, 2, 8]]
>>> c = groupby(b, key=lambda x: x[0])
>>> c
<itertools.groupby object at 0x106b763e0>
>>> d = {k: list(v) for k, v in c}
>>> d
{1: [[1, 2, 3, 5], [1, 5, 7, 9]], 3: [[3, 5, 8, 2]], 4: [[4, 6, 2, 8]]}
Now, if you need all lists where the first element is 1
, you simply need:
>>> d[1]
[[1, 2, 3, 5], [1, 5, 7, 9]]
Or if you wanted everything but 1 in that first position.
>>> [x for k, v in d.items()
... if k != 1
... for x in v]
[[3, 5, 8, 2], [4, 6, 2, 8]]
This is obviously a simpler example, but should be easily applicable to your situation.
I got about a 33% improvement by moving the filter to a cython file and compiling. The primary speedup I think is in eliminating the reload of listingID
for each compare. Just a guess on that.
test.pyx
def all_listings_filter(list data, int listingID):
return [row for row in data if row[14] == listingID]
command line
cython3 test.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python3.10 -o test.so test.c