Filter rows from a grouped data frame based on string columns

Question:

I have a data frame grouped by multiple columns but in this example it would be grouped only by Year.

   Year Animal1  Animal2
0  2002    Dog   Mouse,Lion
1  2002  Mouse            
2  2002   Lion            
3  2002   Duck            
4  2010    Dog   Cat
5  2010    Cat            
6  2010   Lion            
7  2010  Mouse      

I would like for each group, from the rows where Animal2 is empty to filter out the rows where Animal2 does not appear in the column Animal1.

The expected output would be:

  Year Animal1   Animal2
0  2002    Dog   Mouse,Lion
1  2002  Mouse            
2  2002   Lion                   
3  2010    Dog   Cat
4  2010    Cat                        

Rows 0 & 3 stayed since Animal2 is not empty.

Rows 1 & 2 stayed since Mouse & Lion are in Animal2 for the first group.

Row 4 stayed since cat appear in Animal2 for the second group

EDIT: I get an error for a similar input data frame

  Year Animal1   Animal2
0  2002    Dog   Mouse
1  2002  Mouse            
2  2002   Lion                   
3  2010    Dog   
4  2010    Cat  
   

The expected output would be:

  Year Animal1   Animal2
0  2002    Dog   Mouse
1  2002  Mouse    
     

The error is triggered in the .apply(lambda g: g.isin(sets[g.name])) part of the code.

  if not any(isinstance(k, slice) for k in key):
    
                if len(key) == self.nlevels and self.is_unique:
                    # Complete key in unique index -> standard get_loc
                    try:
                        return (self._engine.get_loc(key), None)
                    except KeyError as err:
                       raise KeyError(key) from err
                         KeyError: (2010, 'Dog')

           
Asked By: the phoenix

||

Answers:

You can use masks and regexes:

# non empty Animal2
m1 = df['Animal2'].notna()

# make patterns with those Animals2 per Year
patterns = df[m1].groupby('Year')['Animal2'].agg('|'.join).str.replace(',', '|')

# for each Year select with the matching regex
m2 = (df.groupby('Year', group_keys=False)['Animal1']
        .apply(lambda g: g.str.fullmatch(patterns[g.name]))
     )

out = df.loc[m1|m2]

Or sets:

m1 = df['Animal2'].notna()

sets = (df.loc[m1, 'Animal2'].str.split(',')
          .groupby(df['Year'])
          .agg(lambda x: set().union(*x))
       )

m2 = (df.groupby('Year', group_keys=False)['Animal1']
        .apply(lambda g: g.isin(sets[g.name]))
     )

out = df.loc[m1|m2]

Output:

   Year Animal1     Animal2
0  2002     Dog  Mouse,Lion
1  2002   Mouse        None
2  2002    Lion        None
4  2010     Dog         Cat
5  2010     Cat        None
Answered By: mozway

Here is a solution using list comprehension

(df.loc[
    [a1 in a2 for a1,a2 in zip(df['Animal1'],df['Year'].map(df['Animal2'].str.split(',').groupby(df['Year']).sum()))] | 
    df['Animal2'].notna()]
    )

or

d = df['Animal2'].str.split(',').groupby(df['Year']).sum()

(df.loc[df.groupby('Year')['Animal1'].transform(lambda x: x.isin(d.loc[x.name])) | 
df['Animal2'].notna()]
)

Output:

   Year Animal1     Animal2
0  2002     Dog  Mouse,Lion
1  2002   Mouse        None
2  2002    Lion        None
4  2010     Dog         Cat
5  2010     Cat        None
Answered By: rhug123