Creating a long masking list (Python)

Question:

Here is what I have:

long_list = a very long list of integer values (6M+ entries)

wanted_list = a list of integer values that are of interest (70K entries)

What I need:

mask_list = a list of booleans of the same length as long_list, describing whether each element of long_list in present in the wanted_list (i.e. [is long_list[0] in wanted_list?, is long_list[1] in wanted_list?,….]). The number of’True’ entries in this list should be the same as len(wanted_list)

I got a working code using a for loop, but as expected it’s way too slow for the length of the lists that I am working with (takes several minutes to run):

masklist = []

for element in long_list:
    if element in wanted_list:
        masklist.append(True)
    else: 
        masklist.append(False)

I was wondering if there is a more elegant and fast way to achieve this goal? I was looking into numpy.ma module, but could not think of an elegant way to apply it to this problem

Asked By: jingajinga

||

Answers:

This is slow because the algorithm has a poor complexity (O(n²)). This can be done in O(n):

wanted_set = set(wanted_list)
masklist = [element in wanted_set for element in long_list]

If you operate on Numpy array, then you can use np.unique and np.searchsorted but it will be computed in O(n log n) time. It may be faster for small arrays (due to the Numpy implementation), but not for huge ones (due to the complexity and inefficient cache accesses). np.isin may also be useful and maybe faster.

Answered By: Jérôme Richard
import pandas as pd

long_list = [1,2,3,4]
wanted_list = [1,3,3]
wanted_dict = dict.fromkeys(wanted_list,True)

long_series = pd.Series(long_list)
long_series = long_series.map(wanted_dict)
long_series = long_series.fillna(False)

print(long_series)
Answered By: Mazhar

You can use numpy.isin for this:

masklist = np.isin(long_list, wanted_list)
Answered By: Jab