Using GridSearchCV with IsolationForest for finding outliers
Question:
I want to use IsolationForest
for finding outliers. I want to find the best parameters for model with GridSearchCV
. The problem is that I always get the same error:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
max_features=1.0, max_samples='auto', n_estimators=100,
n_jobs=None, random_state=None, verbose=0, warm_start=False) does not.
It seems like its a problem because IsolationForest
does not have score
method.
Is there a way to fix this?
Also is there a way to find a score for isolation forest?
This is my code:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],
'second': [42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],
'third': [3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],
'result': [5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})
x = df.iloc[:,:-1]
tuned = {'n_estimators':[70,80,100,120,150,200], 'max_samples':['auto', 1,3,5,7,10],
'contamination':['legacy', 'outo'], 'max_features':[1,2,3,4,5,6,7,8,9,10,13,15],
'bootstrap':[True,False], 'n_jobs':[None,1,2,3,4,5,6,7,8,10,15,20,25,30], 'behaviour':['old', 'new'],
'random_state':[None,1,5,10,42], 'verbose':[0,1,2,3,4,5,6,7,8,9,10], 'warm_start':[True,False]}
isolation_forest = GridSearchCV(IsolationForest(), tuned)
model = isolation_forest.fit(x)
list_of_val = [[1,35,3], [3,4,5], [1,4,66], [4,6,1], [135,5,0]]
df['outliers'] = model.predict(x)
df['outliers'] = df['outliers'].map({-1: 'outlier', 1: 'good'})
print(model.best_params_)
print(df)
Answers:
I believe the scoring is referring to the GridSearchCV object, and not the IsolationForest.
If it is “None” (default) it will try to use the estimators scoring, which as you state does not exist. Try using one of the available scoring metrics suitable to your problem within the GridSearchCV object
You need to create your own scoring function since IsolationForest
does not have score
method inbuilt. Instead you can make use of the score_samples
function that is available in IsolationForest
(can be considered as a proxy for score
) and create your own scorer as described here and pass it to the GridSearchCV
. I have modified your code to do this:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],
'second': [42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],
'third': [3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],
'result': [5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})
x = df.iloc[:,:-1]
tuned = {'n_estimators':[70,80], 'max_samples':['auto'],
'contamination':['legacy'], 'max_features':[1],
'bootstrap':[True], 'n_jobs':[None,1,2], 'behaviour':['old'],
'random_state':[None,1,], 'verbose':[0,1,2], 'warm_start':[True]}
def scorer_f(estimator, X): #your own scorer
return np.mean(estimator.score_samples(X))
#or you could use a lambda aexpression as shown below
#scorer = lambda est, data: np.mean(est.score_samples(data))
isolation_forest = GridSearchCV(IsolationForest(), tuned, scoring=scorer_f)
model = isolation_forest.fit(x)
SAMPLE OUTPUT
print(model.best_params_)
{'behaviour': 'old',
'bootstrap': True,
'contamination': 'legacy',
'max_features': 1,
'max_samples': 'auto',
'n_estimators': 70,
'n_jobs': None,
'random_state': None,
'verbose': 1,
'warm_start': True}
I want to use IsolationForest
for finding outliers. I want to find the best parameters for model with GridSearchCV
. The problem is that I always get the same error:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
max_features=1.0, max_samples='auto', n_estimators=100,
n_jobs=None, random_state=None, verbose=0, warm_start=False) does not.
It seems like its a problem because IsolationForest
does not have score
method.
Is there a way to fix this?
Also is there a way to find a score for isolation forest?
This is my code:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],
'second': [42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],
'third': [3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],
'result': [5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})
x = df.iloc[:,:-1]
tuned = {'n_estimators':[70,80,100,120,150,200], 'max_samples':['auto', 1,3,5,7,10],
'contamination':['legacy', 'outo'], 'max_features':[1,2,3,4,5,6,7,8,9,10,13,15],
'bootstrap':[True,False], 'n_jobs':[None,1,2,3,4,5,6,7,8,10,15,20,25,30], 'behaviour':['old', 'new'],
'random_state':[None,1,5,10,42], 'verbose':[0,1,2,3,4,5,6,7,8,9,10], 'warm_start':[True,False]}
isolation_forest = GridSearchCV(IsolationForest(), tuned)
model = isolation_forest.fit(x)
list_of_val = [[1,35,3], [3,4,5], [1,4,66], [4,6,1], [135,5,0]]
df['outliers'] = model.predict(x)
df['outliers'] = df['outliers'].map({-1: 'outlier', 1: 'good'})
print(model.best_params_)
print(df)
I believe the scoring is referring to the GridSearchCV object, and not the IsolationForest.
If it is “None” (default) it will try to use the estimators scoring, which as you state does not exist. Try using one of the available scoring metrics suitable to your problem within the GridSearchCV object
You need to create your own scoring function since IsolationForest
does not have score
method inbuilt. Instead you can make use of the score_samples
function that is available in IsolationForest
(can be considered as a proxy for score
) and create your own scorer as described here and pass it to the GridSearchCV
. I have modified your code to do this:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],
'second': [42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],
'third': [3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],
'result': [5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})
x = df.iloc[:,:-1]
tuned = {'n_estimators':[70,80], 'max_samples':['auto'],
'contamination':['legacy'], 'max_features':[1],
'bootstrap':[True], 'n_jobs':[None,1,2], 'behaviour':['old'],
'random_state':[None,1,], 'verbose':[0,1,2], 'warm_start':[True]}
def scorer_f(estimator, X): #your own scorer
return np.mean(estimator.score_samples(X))
#or you could use a lambda aexpression as shown below
#scorer = lambda est, data: np.mean(est.score_samples(data))
isolation_forest = GridSearchCV(IsolationForest(), tuned, scoring=scorer_f)
model = isolation_forest.fit(x)
SAMPLE OUTPUT
print(model.best_params_)
{'behaviour': 'old',
'bootstrap': True,
'contamination': 'legacy',
'max_features': 1,
'max_samples': 'auto',
'n_estimators': 70,
'n_jobs': None,
'random_state': None,
'verbose': 1,
'warm_start': True}