Creating filtered datasets from multiple data frames
Question:
I would like to create filtered datasets based on multiple dataframes (data frames are different from each other, as the topics are different). For each dataframe I would need to filter rows based on some key words. For example, for the first dataframe, I would need only rows that contain certain words (e.g. Michael
and Andrew
); for the second dataframe I would need only rows that include the word Laura
, and so on.
Original dataframe(s) example
df["0"]
Names Surnames
Michael Connelly
John Smith
Andrew Star
Laura Parker
df["1"]
Names Surnames
Laura Bistro
Lisa Roberts
Luke Gary
Norman Loren
To do this, I wrote the following
for i in range(0,1): # I have more than 50 data frames, but I am considering only two for this example
key_words = []
while True:
key_word = input("Key word : ")
if key_word!='0':
list_key_words.append(key_word)
dataframe[str(i)].Filter= dataframe[str(i)]..str.contains('|'.join(key_word), case=False, regex=True) # Creates a new column where with boolean values
dataframe[str(i)].loc[dataframe[str(i)].Filter != False]
filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them
Expected output:
df["0"]
Names Surnames Filter
Michael Connelly 1
John Smith 0
Andrew Star 1
Laura Parker 0
df["1"]
Names Surnames Filter
Laura Bistro 1
Lisa Roberts 0
Luke Gary 0
Norman Loren 0
Then, the filtered datasets should have respectively 2 rows and 1 row.
filtered["0"]
Names Surnames Filter
Michael Connelly 1
Andrew Star 1
filtered["1"]
Names Surnames Filter
Laura Bistro 1
However, it seems that the lines of code for filtering are wrong in my code.
Could you please have a look at them and let me know where the error is?
Answers:
list_key_words = []
# BUG 1: range(first index included, last index excluded), to get 1 you need range(0, 2)
for i in range(0,2): # I have more than 50 data frames, but I am considering only two for this example
key_words = []
while True:
key_word = input("Key word : ")
if key_word!='0':
list_key_words.append(key_word)
# BUG 2.1: you can't apply ".str.contains" to an entire row, you need to indicate the column by name, e.g. "Names".
# If you want to test all the columns, you need multiple filter columns which you OR at the end
# BUG 2.2: You can't create a column using ".Filter", it needs to be "["Filter"]"
dataframe[str(i)]["Filter"]=dataframe[str(i)]["Names"].str.contains(key_word, case=False, regex=True) # Creates a new column where with boolean values
#BUG 3: this line does nothing
dataframe[str(i)].loc[dataframe[str(i)].Filter != False]
#BUG 5: You need a way to save these or they will be overwritten each time
filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them
#BUG 6: you need to actually leave the "while True" loop at some point
else:
break
Comments about the fixes are in the code. The big issue is bug 2.1, you cannot apply the regex to all the fields in the row at once. If you want to check all the fields, you can make fresh filter columns for each field and regroup the with an df["Filter 1"] | df ["Filter 2"]...
boolean logic at the end.
As much as possible, avoid creating for loops within a dataframe, as pandas and numpy offer vectorized (faster) methods for a lot of common case problems. The solution below pairs the search words with the respective dataframes, conducts the search, and collates results into a coll
list.
#create lists of words per df you need
list1=['Michael','Andrew']
list2=['Laura']
coll = []
#pair lists with dfs
for df,name in zip([df1,df2],(list1,list2)):
df['Extract'] = np.where(df.Names.str.contains('|'.join(name)),
1,0
)
coll.append(df)
coll[0]
Names Surnames Extract
0 Michael Connelly 1
1 John Smith 0
2 Andrew Star 1
3 Laura Parker 0
coll[1]
Names Surnames Extract
0 Laura Bistro 1
1 Lisa Roberts 0
2 Luke Gary 0
3 Norman Loren 0
I would like to create filtered datasets based on multiple dataframes (data frames are different from each other, as the topics are different). For each dataframe I would need to filter rows based on some key words. For example, for the first dataframe, I would need only rows that contain certain words (e.g. Michael
and Andrew
); for the second dataframe I would need only rows that include the word Laura
, and so on.
Original dataframe(s) example
df["0"]
Names Surnames
Michael Connelly
John Smith
Andrew Star
Laura Parker
df["1"]
Names Surnames
Laura Bistro
Lisa Roberts
Luke Gary
Norman Loren
To do this, I wrote the following
for i in range(0,1): # I have more than 50 data frames, but I am considering only two for this example
key_words = []
while True:
key_word = input("Key word : ")
if key_word!='0':
list_key_words.append(key_word)
dataframe[str(i)].Filter= dataframe[str(i)]..str.contains('|'.join(key_word), case=False, regex=True) # Creates a new column where with boolean values
dataframe[str(i)].loc[dataframe[str(i)].Filter != False]
filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them
Expected output:
df["0"]
Names Surnames Filter
Michael Connelly 1
John Smith 0
Andrew Star 1
Laura Parker 0
df["1"]
Names Surnames Filter
Laura Bistro 1
Lisa Roberts 0
Luke Gary 0
Norman Loren 0
Then, the filtered datasets should have respectively 2 rows and 1 row.
filtered["0"]
Names Surnames Filter
Michael Connelly 1
Andrew Star 1
filtered["1"]
Names Surnames Filter
Laura Bistro 1
However, it seems that the lines of code for filtering are wrong in my code.
Could you please have a look at them and let me know where the error is?
list_key_words = []
# BUG 1: range(first index included, last index excluded), to get 1 you need range(0, 2)
for i in range(0,2): # I have more than 50 data frames, but I am considering only two for this example
key_words = []
while True:
key_word = input("Key word : ")
if key_word!='0':
list_key_words.append(key_word)
# BUG 2.1: you can't apply ".str.contains" to an entire row, you need to indicate the column by name, e.g. "Names".
# If you want to test all the columns, you need multiple filter columns which you OR at the end
# BUG 2.2: You can't create a column using ".Filter", it needs to be "["Filter"]"
dataframe[str(i)]["Filter"]=dataframe[str(i)]["Names"].str.contains(key_word, case=False, regex=True) # Creates a new column where with boolean values
#BUG 3: this line does nothing
dataframe[str(i)].loc[dataframe[str(i)].Filter != False]
#BUG 5: You need a way to save these or they will be overwritten each time
filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them
#BUG 6: you need to actually leave the "while True" loop at some point
else:
break
Comments about the fixes are in the code. The big issue is bug 2.1, you cannot apply the regex to all the fields in the row at once. If you want to check all the fields, you can make fresh filter columns for each field and regroup the with an df["Filter 1"] | df ["Filter 2"]...
boolean logic at the end.
As much as possible, avoid creating for loops within a dataframe, as pandas and numpy offer vectorized (faster) methods for a lot of common case problems. The solution below pairs the search words with the respective dataframes, conducts the search, and collates results into a coll
list.
#create lists of words per df you need
list1=['Michael','Andrew']
list2=['Laura']
coll = []
#pair lists with dfs
for df,name in zip([df1,df2],(list1,list2)):
df['Extract'] = np.where(df.Names.str.contains('|'.join(name)),
1,0
)
coll.append(df)
coll[0]
Names Surnames Extract
0 Michael Connelly 1
1 John Smith 0
2 Andrew Star 1
3 Laura Parker 0
coll[1]
Names Surnames Extract
0 Laura Bistro 1
1 Lisa Roberts 0
2 Luke Gary 0
3 Norman Loren 0