Creating filtered datasets from multiple data frames

Question:

I would like to create filtered datasets based on multiple dataframes (data frames are different from each other, as the topics are different). For each dataframe I would need to filter rows based on some key words. For example, for the first dataframe, I would need only rows that contain certain words (e.g. Michael and Andrew); for the second dataframe I would need only rows that include the word Laura, and so on.

Original dataframe(s) example

df["0"]

Names Surnames
Michael Connelly
John    Smith
Andrew   Star
Laura   Parker

df["1"]

Names Surnames
Laura  Bistro
Lisa    Roberts
Luke    Gary
Norman  Loren

To do this, I wrote the following

for i in range(0,1): # I have more than 50 data frames, but I am considering only two for this example
    key_words = [] 

    while True:
        key_word = input("Key word : ")

        if key_word!='0':
            list_key_words.append(key_word)
            dataframe[str(i)].Filter= dataframe[str(i)]..str.contains('|'.join(key_word), case=False, regex=True) # Creates a new column where with boolean values
            dataframe[str(i)].loc[dataframe[str(i)].Filter != False]

            filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
            filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them

Expected output:

df["0"]

Names Surnames  Filter
Michael Connelly 1
John    Smith    0
Andrew   Star    1
Laura   Parker   0

df["1"]

Names Surnames   Filter
Laura  Bistro     1
Lisa    Roberts   0
Luke    Gary      0
Norman  Loren     0

Then, the filtered datasets should have respectively 2 rows and 1 row.

filtered["0"]

Names Surnames  Filter
Michael Connelly 1
Andrew   Star    1


filtered["1"]

Names Surnames   Filter
Laura  Bistro     1

However, it seems that the lines of code for filtering are wrong in my code.
Could you please have a look at them and let me know where the error is?

Asked By: user12809368

||

Answers:

list_key_words = []
# BUG 1: range(first index included, last index excluded), to get 1 you need range(0, 2)
for i in range(0,2): # I have more than 50 data frames, but I am considering only two for this example
    key_words = [] 

    while True:
        key_word = input("Key word : ")

        if key_word!='0':
            list_key_words.append(key_word)

            # BUG 2.1: you can't apply ".str.contains" to an entire row, you need to indicate the column by name, e.g. "Names". 
            # If you want to test all the columns, you need multiple filter columns which you OR at the end
            # BUG 2.2: You can't create a column using ".Filter", it needs to be "["Filter"]"
            dataframe[str(i)]["Filter"]=dataframe[str(i)]["Names"].str.contains(key_word, case=False, regex=True) # Creates a new column where with boolean values

            #BUG 3: this line does nothing
            dataframe[str(i)].loc[dataframe[str(i)].Filter != False]


            #BUG 5: You need a way to save these or they will be overwritten each time
            filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
            filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them

        #BUG 6: you need to actually leave the "while True" loop at some point
        else:
            break

Comments about the fixes are in the code. The big issue is bug 2.1, you cannot apply the regex to all the fields in the row at once. If you want to check all the fields, you can make fresh filter columns for each field and regroup the with an df["Filter 1"] | df ["Filter 2"]... boolean logic at the end.

Answered By: sigma1510

As much as possible, avoid creating for loops within a dataframe, as pandas and numpy offer vectorized (faster) methods for a lot of common case problems. The solution below pairs the search words with the respective dataframes, conducts the search, and collates results into a coll list.

#create lists of words per df you need
list1=['Michael','Andrew']
list2=['Laura']

coll = []
#pair lists with dfs
for df,name in zip([df1,df2],(list1,list2)):
    df['Extract'] = np.where(df.Names.str.contains('|'.join(name)),
                             1,0                            
                            )
    coll.append(df)

coll[0]

   Names    Surnames    Extract
0   Michael Connelly    1
1   John    Smith       0
2   Andrew  Star        1
3   Laura   Parker      0

coll[1]

   Names    Surnames    Extract
0   Laura   Bistro        1
1   Lisa    Roberts       0
2   Luke    Gary          0
3   Norman  Loren         0
Answered By: sammywemmy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.