Filter Dataframe based on a list of codes, but each value of the column in question contains a list of many keys
Question:
I have data (df1) that looks like this:
INC_KEY AISPREDOT
180008916795 "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
180008916796 "[140655.0, 140694.0]"
180008916797 "[853151.0]"
180008916798 "[110402.0, 140652.0, 150202.0]"
180008916799 "[857300.0]"
180008916800 "[650634.0]"
180008916801 "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"
180008916802 "[816018.0, 854472.0]"
180008916803 "[442200.0, 442202.0, 450203.0]"
180008916804 "[853151.0]"
Where INC_KEY is set as the index. I also have a list of codes:
codes = [110402.0, 854362.0]
As you can see, each index contains a list of different codes (AISPREDOT), however the list is in the dataframe as a string. I need somehow read these strings as a list, and then filter df1 and create a new dataframe, df2, where df2 contains only the indices that contain at least one of the codes in the list codes.
So the resulting dataframe (df2) would look like this:
INC_KEY AISPREDOT
180008916795 "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
180008916798 "[110402.0, 140652.0, 150202.0]"
180008916801 "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"
How do I go about achieving this?
Answers:
Use ast.literal_eval
to convert strings as list then explode list and select right rows:
import ast
idx = (df['AISPREDOT'].str.strip('"').map(ast.literal_eval).explode()
.isin(codes).loc[lambda x: x].index)
out = df.loc[np.unique(idx)]
print(out)
# Output
AISPREDOT
INC_KEY
180008916795 "[110402.0, 110602.0, 140651.0, 140694.0, 1504...
180008916798 "[110402.0, 140652.0, 150202.0]"
180008916799 "[857300.0]"
You can also make transformation persistent:
df['AISPREDOT'] = df['AISPREDOT'].str.strip('"').map(ast.literal_eval)
idx = df['AISPREDOT'].explode().isin(codes).loc[lambda x: x].index
out = df.loc[np.unique(idx)]
print(out)
# Output
AISPREDOT
INC_KEY
180008916795 [110402.0, 110602.0, 140651.0, 140694.0, 15040...
180008916798 [110402.0, 140652.0, 150202.0]
180008916799 [857300.0]
Looks like a good use case for a regex and str.contains
:
codes = [110402.0, 854362.0]
pattern = fr"b(?:{'|'.join(map(str, codes))})b"
# '\b(?:110402.0|854362.0\b'
out = df.loc[df['AISPREDOT'].str.contains(pattern)]
Output:
INC_KEY AISPREDOT
0 180008916795 "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
3 180008916798 "[110402.0, 140652.0, 150202.0]"
6 180008916801 "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"
I have data (df1) that looks like this:
INC_KEY AISPREDOT
180008916795 "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
180008916796 "[140655.0, 140694.0]"
180008916797 "[853151.0]"
180008916798 "[110402.0, 140652.0, 150202.0]"
180008916799 "[857300.0]"
180008916800 "[650634.0]"
180008916801 "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"
180008916802 "[816018.0, 854472.0]"
180008916803 "[442200.0, 442202.0, 450203.0]"
180008916804 "[853151.0]"
Where INC_KEY is set as the index. I also have a list of codes:
codes = [110402.0, 854362.0]
As you can see, each index contains a list of different codes (AISPREDOT), however the list is in the dataframe as a string. I need somehow read these strings as a list, and then filter df1 and create a new dataframe, df2, where df2 contains only the indices that contain at least one of the codes in the list codes.
So the resulting dataframe (df2) would look like this:
INC_KEY AISPREDOT
180008916795 "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
180008916798 "[110402.0, 140652.0, 150202.0]"
180008916801 "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"
How do I go about achieving this?
Use ast.literal_eval
to convert strings as list then explode list and select right rows:
import ast
idx = (df['AISPREDOT'].str.strip('"').map(ast.literal_eval).explode()
.isin(codes).loc[lambda x: x].index)
out = df.loc[np.unique(idx)]
print(out)
# Output
AISPREDOT
INC_KEY
180008916795 "[110402.0, 110602.0, 140651.0, 140694.0, 1504...
180008916798 "[110402.0, 140652.0, 150202.0]"
180008916799 "[857300.0]"
You can also make transformation persistent:
df['AISPREDOT'] = df['AISPREDOT'].str.strip('"').map(ast.literal_eval)
idx = df['AISPREDOT'].explode().isin(codes).loc[lambda x: x].index
out = df.loc[np.unique(idx)]
print(out)
# Output
AISPREDOT
INC_KEY
180008916795 [110402.0, 110602.0, 140651.0, 140694.0, 15040...
180008916798 [110402.0, 140652.0, 150202.0]
180008916799 [857300.0]
Looks like a good use case for a regex and str.contains
:
codes = [110402.0, 854362.0]
pattern = fr"b(?:{'|'.join(map(str, codes))})b"
# '\b(?:110402.0|854362.0\b'
out = df.loc[df['AISPREDOT'].str.contains(pattern)]
Output:
INC_KEY AISPREDOT
0 180008916795 "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
3 180008916798 "[110402.0, 140652.0, 150202.0]"
6 180008916801 "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"