Applying conditions to dataframe based on columns specified in a list
Question:
I have the sample df as follows:
df = pd.DataFrame({'weight':[10,20,30,40,50],
'speed':[100,120,140,160,180],
'distance':[1000,1100,1200,1300,1400],
'cat':['Y','N','N','N','Y']})
And I am applying the conditions to my df based on the values given below in the following way.
speed_margin = 160
weight_margin = 20
distance_margin = 1300
catergory = 'N'
conditions = np.where((df['speed'] < speed_margin) & (df['cat'] == catergory)
& (df['weight'] > weight_margin) & (df['distance']<distance_margin))
df1 = df.loc[conditions]
However, the columns for the conditions are not always the same, and they are suuplied by the user in the form of a list. For example, if:
conditions_list = ['speed', 'distance', 'cat']
I need to automate the above conditions
code to only include the columns that are supplied by the user in conditions_list
. As in this case, since there are only 3 column names in the conditions_list
(weight
col is missing), the conditions
must look like:
conditions = np.where((df['speed'] < speed_margin) & (df['cat'] == catergory)
& (df['distance']<distance_margin))
if conditions_list
was:
conditions_list = ['speed']
Then, conditions
must be:
conditions = np.where((df['speed'] < speed_margin))
How can I make sure the conditions are applied only to the columns that are supplied in the list by the user?
Answers:
One way would be to define each condition as a lambda to be applied to some column, then use pd.DataFrame.transform
for checking each condition, finally aggregating with pd.DataFrame.all
:
import pandas as pd
df = pd.DataFrame({'weight':[10,20,30,40,50],
'speed':[100,120,140,160,180],
'distance':[1000,1100,1200,1300,1400],
'cat':['Y','N','N','N','Y']})
speed_margin = 160
weight_margin = 20
distance_margin = 1300
catergory = 'N'
conditions_list = ['speed', 'distance', 'cat']
funcs = {
"speed": lambda x: x < speed_margin,
"cat": lambda x: x == catergory,
"weight": lambda x: x > weight_margin,
"distance": lambda x: x < distance_margin,
}
conditions = df.transform({col: funcs[col] for col in conditions_list}).all(axis=1)
out = df.loc[conditions]
out:
weight speed distance cat
1 20 120 1100 N
2 30 140 1200 N
PS: if you use np.where
, which returns integer indices, safer would be to use iloc
instead of loc
; but even better would be to omit it entirely and use the boolean mask directly as above.
Make a condition mapping which maps columns to the needed subqueries/conditions. That will allow a quick dataframe query on requested column list:
cond_map = {'speed': 'speed < 160',
'weight': 'weight > 20',
'distance': 'distance < 1300',
'cat': 'cat == "N"'}
df_ = df.query(' and '.join(cond_map[c] for c in cond_list))
Case #1:
cond_list = ['speed', 'distance', 'cat']
df_ = df.query(' and '.join(cond_map[c] for c in cond_list))
print(df_)
weight speed distance cat
1 20 120 1100 N
2 30 140 1200 N
Case #2:
cond_list = ['speed']
df_ = df.query(' and '.join(cond_map[c] for c in cond_list))
print(df_)
weight speed distance cat
0 10 100 1000 Y
1 20 120 1100 N
2 30 140 1200 N
I have the sample df as follows:
df = pd.DataFrame({'weight':[10,20,30,40,50],
'speed':[100,120,140,160,180],
'distance':[1000,1100,1200,1300,1400],
'cat':['Y','N','N','N','Y']})
And I am applying the conditions to my df based on the values given below in the following way.
speed_margin = 160
weight_margin = 20
distance_margin = 1300
catergory = 'N'
conditions = np.where((df['speed'] < speed_margin) & (df['cat'] == catergory)
& (df['weight'] > weight_margin) & (df['distance']<distance_margin))
df1 = df.loc[conditions]
However, the columns for the conditions are not always the same, and they are suuplied by the user in the form of a list. For example, if:
conditions_list = ['speed', 'distance', 'cat']
I need to automate the above conditions
code to only include the columns that are supplied by the user in conditions_list
. As in this case, since there are only 3 column names in the conditions_list
(weight
col is missing), the conditions
must look like:
conditions = np.where((df['speed'] < speed_margin) & (df['cat'] == catergory)
& (df['distance']<distance_margin))
if conditions_list
was:
conditions_list = ['speed']
Then, conditions
must be:
conditions = np.where((df['speed'] < speed_margin))
How can I make sure the conditions are applied only to the columns that are supplied in the list by the user?
One way would be to define each condition as a lambda to be applied to some column, then use pd.DataFrame.transform
for checking each condition, finally aggregating with pd.DataFrame.all
:
import pandas as pd
df = pd.DataFrame({'weight':[10,20,30,40,50],
'speed':[100,120,140,160,180],
'distance':[1000,1100,1200,1300,1400],
'cat':['Y','N','N','N','Y']})
speed_margin = 160
weight_margin = 20
distance_margin = 1300
catergory = 'N'
conditions_list = ['speed', 'distance', 'cat']
funcs = {
"speed": lambda x: x < speed_margin,
"cat": lambda x: x == catergory,
"weight": lambda x: x > weight_margin,
"distance": lambda x: x < distance_margin,
}
conditions = df.transform({col: funcs[col] for col in conditions_list}).all(axis=1)
out = df.loc[conditions]
out:
weight speed distance cat
1 20 120 1100 N
2 30 140 1200 N
PS: if you use np.where
, which returns integer indices, safer would be to use iloc
instead of loc
; but even better would be to omit it entirely and use the boolean mask directly as above.
Make a condition mapping which maps columns to the needed subqueries/conditions. That will allow a quick dataframe query on requested column list:
cond_map = {'speed': 'speed < 160',
'weight': 'weight > 20',
'distance': 'distance < 1300',
'cat': 'cat == "N"'}
df_ = df.query(' and '.join(cond_map[c] for c in cond_list))
Case #1:
cond_list = ['speed', 'distance', 'cat']
df_ = df.query(' and '.join(cond_map[c] for c in cond_list))
print(df_)
weight speed distance cat
1 20 120 1100 N
2 30 140 1200 N
Case #2:
cond_list = ['speed']
df_ = df.query(' and '.join(cond_map[c] for c in cond_list))
print(df_)
weight speed distance cat
0 10 100 1000 Y
1 20 120 1100 N
2 30 140 1200 N