Filter out rows based on list of strings in Pandas

Question:

I have a large time series data frame (called df), and the first 5 records look like this:

df

         stn     years_of_data  total_minutes avg_daily TOA_daily   K_daily
date                        
1900-01-14  AlberniElementary      4    5745    34.100  114.600 0.298
1900-01-14  AlberniWeather         6    7129    29.500  114.600 0.257
1900-01-14  Arbutus                8    11174   30.500  114.600 0.266
1900-01-14  Arrowview              7    10080   27.600  114.600 0.241
1900-01-14  Bayside                7    9745    33.800  114.600 0.295

Goal:

I am trying to remove rows where any of the strings in a list
are present in the ‘stn’ column. So,I am basically trying to filter this dataset to not include rows containing any of the strings in following list.

Attempt:

remove_list = ['Arbutus','Bayside']

cleaned = df[df['stn'].str.contains('remove_list')]

Returns:

Out[78]:

stn years_of_data   total_minutes   avg_daily   TOA_daily   K_daily
date    

Nothing!

I have tried a few combinations of quotes, brackets, and even a lambda function; though I am fairly new, so probably not using syntax properly..

Asked By: geokrowding

||

Answers:

Use isin:

cleaned = df[~df['stn'].isin(remove_list)]

In [7]:

remove_list = ['Arbutus','Bayside']
df[~df['stn'].isin(remove_list)]
Out[7]:
                          stn  years_of_data  total_minutes  avg_daily  
date                                                                     
1900-01-14  AlberniElementary              4           5745       34.1   
1900-01-14     AlberniWeather              6           7129       29.5   
1900-01-14          Arrowview              7          10080       27.6   

            TOA_daily  K_daily  
date                            
1900-01-14      114.6    0.298  
1900-01-14      114.6    0.257  
1900-01-14      114.6    0.241  
Answered By: EdChum

Had a similar question, found this old thread, I think there are other ways to get the same result. My issue with @EdChum’s solution for my particular application is that I don’t have a list that will be matched exactly. If you have the same issue, .isin isn’t meant for that application.

Instead, you can also try a few options, including a numpy.where:

  removelist = ['ayside','rrowview']
  df['flagCol'] = numpy.where(df.stn.str.contains('|'.join(remove_list)),1,0)

Note that this solution doesn’t actually remove the matching rows, just flags them. You can copy/slice/drop as you like.

This solution would be useful in the case that you don’t know, for example, if the station names are capitalized or not and don’t want to go through standardizing text beforehand. numpy.where is usually pretty fast as well, probably not much different from .isin.

Answered By: rajan

I just want to add my 2 cents to this very important use case (of filtering out a list of items, indexed by string values). The argument of .isin() method, doesn’t need to be a list! It can be a pd.Series! Then you can do things like this:

df[~df['stn'].isin(another_df['stn_to_remove_column_there'])]

See what I mean? You can use this construct without a .to_list() method.

Answered By: Alex Fedotov
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.