Run and rank all combination of features to machine learning model
Question:
I have a train and test data set which contains 30 independent features and 1 target feature.
All the features are numerical variables. An example of the train data set looks like. The test data set also has the same columns
Target
col1
col2
…
col29
col30
20
12
14
…
15
12
25
13
25
…
19
19
I want to write an efficient code to run all combination of the features to a light GBM regressor model on test data set to find out the best combination of features which gave the best MAE.
An example of the result output that I am looking for should look like this
Rank
Features_used
MAE
1
col1,col2,col14,col17,col18
2.40
2
col4,col5,col15,col19,col24
2.50
3
col4,col5,col15,col19,col24,col29,col18,col13
2.50
—
—-
—
—
—-
—
—
—-
—
—
—-
—
n
worst combination of features
Worst MAE
I have tried passing each combination of features individually and finding out the MAE but it seems inefficient while trying out all the combinations.
Predict = 'Target'
train = train[['Target','col1','col2','col3','col4','col5']]
test = test[['Target','col1','col2','col3','col4','col5']]
X_train = train[train.columns.difference([Predict])]
X_test = test[test.columns.difference([Predict])]
y_train = train[Predict]
y_test = test[Predict]
regressor = lightgbm.LGBMRegressor()
regressor= regressor.fit(X_train, y_train,eval_metric = ["MAE"])
y_pred = regressor.predict(X_test)
Is there an efficient way to run all the combination of features and rank the output based on the MAE?
Answers:
-
The first step is the do every combination of the features with keeping the "target" within every combination.
-
The second step is to iterate over every combination, train, predict and calculate the MAE and store it in a dataframe among the features used
-
The final one, is to sort the dataframe based on the MAE.
from itertools import compress, product
import numpy as np
from sklearn.metrics import mean_absolute_error as mae
#This fonctions will be used to have every combinations of features for the model
def combinations(items):
return ( list(set(compress(items,mask))) for mask in product(*[[0,1]]*len(items)) )
def lgbm(train, test, all_columns):
Predict = 'Target'
train = train[all_columns]
test = test[all_columns]
X_train = train[train.columns.difference([Predict])]
X_test = test[test.columns.difference([Predict])]
y_train = train[Predict]
y_test = test[Predict]
regressor = lightgbm.LGBMRegressor()
regressor= regressor.fit(X_train, y_train,eval_metric = ["MAE"])
y_pred = regressor.predict(X_test)
#Calculate the MAE
mae_error = mae(y_test, y_pred)
return mae_error
d = pd.DataFrame(columns=["Features_used","MAE"])
all_columns = ['Target','col1','col2','col3','col4','col5']
#Iterate over every combinations of features and train the model,
#get the MAE and append it with the features used in the dataframe
combi_col = list(combinations(np.arange(start=1, stop=len(all_columns))))[1:] #starting from index 1 to drop empty list
for columns in combi_col:
columns = [all_columns[i] for i in columns+[0]]
#index 0 referes to the target columns because it must be always included in every combination
error = lgbm(train,test, columns)
d = d.append({"Features_used":",".join(columns),"MAE":error},ignore_index=True)
d['Rank'] = d['MAE'].rank(ascending = 0).astype(int)
d = d.sort_values(["MAE"],ascending=False)
d = d[["Rank","Features_used","MAE"]]
d
I have a train and test data set which contains 30 independent features and 1 target feature.
All the features are numerical variables. An example of the train data set looks like. The test data set also has the same columns
Target | col1 | col2 | … | col29 | col30 |
---|---|---|---|---|---|
20 | 12 | 14 | … | 15 | 12 |
25 | 13 | 25 | … | 19 | 19 |
I want to write an efficient code to run all combination of the features to a light GBM regressor model on test data set to find out the best combination of features which gave the best MAE.
An example of the result output that I am looking for should look like this
Rank | Features_used | MAE |
---|---|---|
1 | col1,col2,col14,col17,col18 | 2.40 |
2 | col4,col5,col15,col19,col24 | 2.50 |
3 | col4,col5,col15,col19,col24,col29,col18,col13 | 2.50 |
— | —- | — |
— | —- | — |
— | —- | — |
— | —- | — |
n | worst combination of features | Worst MAE |
I have tried passing each combination of features individually and finding out the MAE but it seems inefficient while trying out all the combinations.
Predict = 'Target'
train = train[['Target','col1','col2','col3','col4','col5']]
test = test[['Target','col1','col2','col3','col4','col5']]
X_train = train[train.columns.difference([Predict])]
X_test = test[test.columns.difference([Predict])]
y_train = train[Predict]
y_test = test[Predict]
regressor = lightgbm.LGBMRegressor()
regressor= regressor.fit(X_train, y_train,eval_metric = ["MAE"])
y_pred = regressor.predict(X_test)
Is there an efficient way to run all the combination of features and rank the output based on the MAE?
-
The first step is the do every combination of the features with keeping the "target" within every combination.
-
The second step is to iterate over every combination, train, predict and calculate the MAE and store it in a dataframe among the features used
-
The final one, is to sort the dataframe based on the MAE.
from itertools import compress, product import numpy as np from sklearn.metrics import mean_absolute_error as mae #This fonctions will be used to have every combinations of features for the model def combinations(items): return ( list(set(compress(items,mask))) for mask in product(*[[0,1]]*len(items)) ) def lgbm(train, test, all_columns): Predict = 'Target' train = train[all_columns] test = test[all_columns] X_train = train[train.columns.difference([Predict])] X_test = test[test.columns.difference([Predict])] y_train = train[Predict] y_test = test[Predict] regressor = lightgbm.LGBMRegressor() regressor= regressor.fit(X_train, y_train,eval_metric = ["MAE"]) y_pred = regressor.predict(X_test) #Calculate the MAE mae_error = mae(y_test, y_pred) return mae_error d = pd.DataFrame(columns=["Features_used","MAE"]) all_columns = ['Target','col1','col2','col3','col4','col5'] #Iterate over every combinations of features and train the model, #get the MAE and append it with the features used in the dataframe combi_col = list(combinations(np.arange(start=1, stop=len(all_columns))))[1:] #starting from index 1 to drop empty list for columns in combi_col: columns = [all_columns[i] for i in columns+[0]] #index 0 referes to the target columns because it must be always included in every combination error = lgbm(train,test, columns) d = d.append({"Features_used":",".join(columns),"MAE":error},ignore_index=True) d['Rank'] = d['MAE'].rank(ascending = 0).astype(int) d = d.sort_values(["MAE"],ascending=False) d = d[["Rank","Features_used","MAE"]] d