Pandas isin() not working properly with numerical values

Question:

I have a pandas dataframe where one column is all float, another column either contains list of floats, None, or just float values. I have ensured all values are floats.

Ultimately, I want to use pd.isin() to check how many records of value_1 are in value_2 but it is not working for me. When I ran this code below:

df[~df['value_1'].isin(df['value_2'])]

This below is what it returned which is not expected since clearly some values in value_1 are in the value_2 lists.:

0     88870.0    [88870.0]  
1.    150700.0    None
2     225000.0   [225000.0, 225000.0]
3.    305000.0   [305606.0, 305000.0, 1067.5]
4     392000.0   [392000.0] 
5     198400.0    396

What am I missing? Please help.

Asked By: amnesic

||

Answers:

Use zip with list comprehension for test if lists not contains floats, if not lists are removed rows by passing False, filter in boolean indexing:

df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0],
                   'value_2':[[88870.0],None, [88870.0,45.4]]})

print (df)
    value_1          value_2
0   88870.0        [88870.0]
1  150700.0             None
2  392000.0  [88870.0, 45.4]

mask = [a not in b if isinstance(b, list) else False 
        for a, b in zip(df['value_1'], df['value_2'])]
df1 = df[mask]
print (df1)
    value_1          value_2
2  392000.0  [88870.0, 45.4]

If need also test scalars:

mask = [a not in b if isinstance(b, list) else a != b 
        for a, b in zip(df['value_1'], df['value_2'])]
df2 = df[mask]
print (df2)
    value_1          value_2
1  150700.0             None
2  392000.0  [88870.0, 45.4]

Performance: Pure python should be faster, best test in real data:

#20k rows
N = 10000
df = pd.DataFrame({'value_1':[88870.0,150700.0,392000.0] * N,
                   'value_2':[[88870.0],None, [88870.0,45.4]] * N})

print (df)


In [51]: %timeit df[[a not in b if isinstance(b, list) else a != b  for a, b in zip(df['value_1'], df['value_2'])]]
18.8 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [52]: %timeit df[[not bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]
419 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Answered By: jezrael

You can use boolean indexing with numpy.isin in a list comprehension:

import numpy as np

out = df[[bool(np.isin(v1, v2)) for v1, v2 in zip(df['value_1'], df['value_2'])]]

Output:

    value_1                       value_2
0   88870.0                     [88870.0]
2  225000.0          [225000.0, 225000.0]
3  305000.0  [305606.0, 305000.0, 1067.5]
4  392000.0                    [392000.0]
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.