Filter Dataframe based on a list of codes, but each value of the column in question contains a list of many keys

Question:

I have data (df1) that looks like this:

INC_KEY        AISPREDOT
180008916795   "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
180008916796   "[140655.0, 140694.0]"
180008916797   "[853151.0]"
180008916798   "[110402.0, 140652.0, 150202.0]"
180008916799   "[857300.0]"
180008916800   "[650634.0]"
180008916801   "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"
180008916802   "[816018.0, 854472.0]"
180008916803   "[442200.0, 442202.0, 450203.0]"
180008916804   "[853151.0]"

Where INC_KEY is set as the index. I also have a list of codes:

codes = [110402.0, 854362.0]

As you can see, each index contains a list of different codes (AISPREDOT), however the list is in the dataframe as a string. I need somehow read these strings as a list, and then filter df1 and create a new dataframe, df2, where df2 contains only the indices that contain at least one of the codes in the list codes.

So the resulting dataframe (df2) would look like this:

INC_KEY        AISPREDOT
180008916795   "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
180008916798   "[110402.0, 140652.0, 150202.0]"
180008916801   "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"

How do I go about achieving this?

Asked By: Sean Roudnitsky

||

Answers:

Use ast.literal_eval to convert strings as list then explode list and select right rows:

import ast

idx = (df['AISPREDOT'].str.strip('"').map(ast.literal_eval).explode()
                      .isin(codes).loc[lambda x: x].index)
out = df.loc[np.unique(idx)]
print(out)

# Output
                                                      AISPREDOT
INC_KEY                                                        
180008916795  "[110402.0, 110602.0, 140651.0, 140694.0, 1504...
180008916798                   "[110402.0, 140652.0, 150202.0]"
180008916799                                       "[857300.0]"

You can also make transformation persistent:

df['AISPREDOT'] = df['AISPREDOT'].str.strip('"').map(ast.literal_eval)
idx = df['AISPREDOT'].explode().isin(codes).loc[lambda x: x].index
out = df.loc[np.unique(idx)]
print(out)

# Output
                                                      AISPREDOT
INC_KEY                                                        
180008916795  [110402.0, 110602.0, 140651.0, 140694.0, 15040...
180008916798                     [110402.0, 140652.0, 150202.0]
180008916799                                         [857300.0]
Answered By: Corralien

Looks like a good use case for a regex and str.contains:

codes = [110402.0, 854362.0]

pattern = fr"b(?:{'|'.join(map(str, codes))})b"
# '\b(?:110402.0|854362.0\b'

out = df.loc[df['AISPREDOT'].str.contains(pattern)]

Output:

        INC_KEY                                                       AISPREDOT
0  180008916795  "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
3  180008916798                                "[110402.0, 140652.0, 150202.0]"
6  180008916801            "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"

regex demo

Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.