'KMeansModel' object has no attribute 'computeCost' in apache pyspark
Question:
I’m experimenting with a clustering model in pyspark. I’m trying to get the mean squared cost of the cluster fit for different values of K
def meanScore(k,df):
inputCol = df.columns[:38]
assembler = VectorAssembler(inputCols=inputCols,outputCol="features")
kmeans = KMeans().setK(k)
pipeModel2 = Pipeline(stages=[assembler,kmeans])
kmeansModel = pipeModel2.fit(df).stages[-1]
kmeansModel.computeCost(assembler.transform(df))/data.count()
When I try to call this function to compute costs for different values of K in the dataframe
for k in range(20,100,20):
sc = meanScore(k,numericOnly)
print((k,sc))
I receive an attribute error as
AttributeError: ‘KMeansModel’ object has no attribute ‘computeCost’
I’m fairly new to pyspark and am just learning, I sincerely appreciate any help with this. Thanks
Answers:
It is deprecated in Spark 3.0.0 Docs suggest using the evaluator.
Note Deprecated in 3.0.0. It will be removed in future versions.
Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
As Erkan sirin mentioned computeCost is deprecated in recent version this may help you solve your problem
# Make predictions
predictions = model.transform(dataset)
from pyspark.ml.evaluation import ClusteringEvaluator
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
I hope this helps, you can check official docs for more informations
Evaluate clustering by computing Silhouette score:
in Spark 3.0.1 and above
print('Silhouette with squared euclidean distance:')
pdt = model.transform(final_data)
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(pdt)
print(silhouette)
Evaluate clustering With in set sum of squre errors(wssse):
spark 2.2 to 3.0.0
cost = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(cost))
As for the current version 3.1.2.
Using KMeans as example, after importing
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
Loading data and training, then it’s just call ‘ClusteringEvaluator()’:
# Make predictions
predictions = model.transform(dataset)
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
For KMeans instead of "computeCost" you can use also:
wssse = kmeansModel.summary.trainingCost
I’m experimenting with a clustering model in pyspark. I’m trying to get the mean squared cost of the cluster fit for different values of K
def meanScore(k,df):
inputCol = df.columns[:38]
assembler = VectorAssembler(inputCols=inputCols,outputCol="features")
kmeans = KMeans().setK(k)
pipeModel2 = Pipeline(stages=[assembler,kmeans])
kmeansModel = pipeModel2.fit(df).stages[-1]
kmeansModel.computeCost(assembler.transform(df))/data.count()
When I try to call this function to compute costs for different values of K in the dataframe
for k in range(20,100,20):
sc = meanScore(k,numericOnly)
print((k,sc))
I receive an attribute error as
AttributeError: ‘KMeansModel’ object has no attribute ‘computeCost’
I’m fairly new to pyspark and am just learning, I sincerely appreciate any help with this. Thanks
It is deprecated in Spark 3.0.0 Docs suggest using the evaluator.
Note Deprecated in 3.0.0. It will be removed in future versions.
Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
As Erkan sirin mentioned computeCost is deprecated in recent version this may help you solve your problem
# Make predictions
predictions = model.transform(dataset)
from pyspark.ml.evaluation import ClusteringEvaluator
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
I hope this helps, you can check official docs for more informations
Evaluate clustering by computing Silhouette score:
in Spark 3.0.1 and above
print('Silhouette with squared euclidean distance:')
pdt = model.transform(final_data)
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(pdt)
print(silhouette)
Evaluate clustering With in set sum of squre errors(wssse):
spark 2.2 to 3.0.0
cost = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(cost))
As for the current version 3.1.2.
Using KMeans as example, after importing
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
Loading data and training, then it’s just call ‘ClusteringEvaluator()’:
# Make predictions
predictions = model.transform(dataset)
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
For KMeans instead of "computeCost" you can use also:
wssse = kmeansModel.summary.trainingCost