'KMeansModel' object has no attribute 'computeCost' in apache pyspark

Question:

I’m experimenting with a clustering model in pyspark. I’m trying to get the mean squared cost of the cluster fit for different values of K

def meanScore(k,df):
  inputCol = df.columns[:38]
  assembler = VectorAssembler(inputCols=inputCols,outputCol="features")
  kmeans = KMeans().setK(k)
  pipeModel2 = Pipeline(stages=[assembler,kmeans])
  kmeansModel = pipeModel2.fit(df).stages[-1]
  kmeansModel.computeCost(assembler.transform(df))/data.count()

When I try to call this function to compute costs for different values of K in the dataframe

for k in range(20,100,20):
  sc = meanScore(k,numericOnly)
  print((k,sc))

I receive an attribute error as
AttributeError: ‘KMeansModel’ object has no attribute ‘computeCost’

I’m fairly new to pyspark and am just learning, I sincerely appreciate any help with this. Thanks

Asked By: kausik sivakumar

||

Answers:

It is deprecated in Spark 3.0.0 Docs suggest using the evaluator.

Note Deprecated in 3.0.0. It will be removed in future versions. 
Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.
Answered By: Erkan Şirin

As Erkan sirin mentioned computeCost is deprecated in recent version this may help you solve your problem

# Make predictions 
predictions = model.transform(dataset)
from pyspark.ml.evaluation import ClusteringEvaluator
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

I hope this helps, you can check official docs for more informations

Answered By: Dhouibi iheb

Evaluate clustering by computing Silhouette score:

in Spark 3.0.1 and above

print('Silhouette with squared euclidean distance:')
pdt = model.transform(final_data)
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(pdt)
print(silhouette)

Evaluate clustering With in set sum of squre errors(wssse):

spark 2.2 to 3.0.0

cost = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(cost))
Answered By: Biswajyoti Dash

As for the current version 3.1.2.

Using KMeans as example, after importing

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

Loading data and training, then it’s just call ‘ClusteringEvaluator()’:

# Make predictions
predictions = model.transform(dataset)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
Answered By: Lorenço

For KMeans instead of "computeCost" you can use also:

wssse = kmeansModel.summary.trainingCost
Answered By: Dimtar Petkov