Performance decrease for huge amount of columns. Pyspark

Question

I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more).

Task:

Create wide DF via groupBy and pivot.
Transform columns to vector and processing in to KMeans from pyspark.ml.

So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans.

It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500×9000. Another side this processing in pandas (pivot df, and iterate 7 clusters) takes less one minute.

Obviously I understand overhead and performance decreasing for standalone mode and caching and so on but it’s really discourages me.

Could somebody explain how I can avoid this overhead?

How peoples work with wide DF instead of using vectorassembler and getting performance decreasing?

More formal question (for sof rules) sound like – How can I speed up this code?

%%time
tmp = (df_states.select('ObjectPath', 'User', 'PropertyFlagValue')
       .groupBy('User')
       .pivot('ObjectPath')
       .agg({'PropertyFlagValue':'max'})
       .fillna(0))
ignore = ['User']
assembler = VectorAssembler(
    inputCols=[x for x in tmp.columns if x not in ignore],
    outputCol='features')
Wall time: 36.7 s

print(tmp.count(), len(tmp.columns))
552, 9378

%%time
transformed = assembler.transform(tmp).select('User', 'features').cache()
Wall time: 10min 45s

%%time
lst_levels = []
for num in range(3, 14):
    kmeans = KMeans(k=num, maxIter=50)
    model = kmeans.fit(transformed)
    lst_levels.append(model.computeCost(transformed))
rs = [i-j for i,j in list(zip(lst_levels, lst_levels[1:]))]
for i, j in zip(rs, rs[1:]):
    if i - j < j:
        print(rs.index(i))
        kmeans = KMeans(k=rs.index(i) + 3, maxIter=50)
        model = kmeans.fit(transformed)
        break
 Wall time: 1min 32s

Config:

.config("spark.sql.pivotMaxValues", "100000") 
.config("spark.sql.autoBroadcastJoinThreshold", "-1") 
.config("spark.sql.shuffle.partitions", "4") 
.config("spark.sql.inMemoryColumnarStorage.batchSize", "1000")

Asked By: Anton Alekseev

||

Source

Answer 1

VectorAssembler’s transform function processes all the columns and stores metadata on each column in addition to the original data. This takes time, and also takes up RAM.

To put an exact figure on how much things have increased, you can dump your data frame before and after the transformation as parquet files and compare. In my experience, a feature vector built by hand or other feature extraction methods compared to one built by VectorAssembler can cause a size increase of 10x and this was for a logistic regression with only 10 parameters. Things will get a lot worse with a data set with as many columns as you have.

A few suggestions:

See if you can build your feature vector another way. I’m not sure how performant this would be in Python, but I’ve got a lot of mileage out of this approach in Scala. I’ve noticed something like a 5x-6x performance difference comparing logistic regressions (10 params) for manually built vectors or vectors built using other extraction methods (TF-IDF) than VectorAssembled ones.
See if you can reshape your data to reduce the number columns that need to be processed by VectorAssembler.
See if increasing the RAM available to Spark helps.

Answered By: fny

Answer 2

Actually solution was found in map for rdd.

First of all we going to create map of values.
Also extract all distinct names.
Penultimate step we are searching each value of rows’ map in dict of names and return value or 0 if nothing was found.
Vector assembler on results.

Advantages:

You haven’t to create wide dataframe with a lot of columns count and hence avoid overhead. (Speed was risen up from 11 minutes to 1.)
You still work on cluster and execute you code in paradigm of spark.

Example of code: scala implementation.

Answered By: Anton Alekseev

Performance decrease for huge amount of columns. Pyspark

Question:

Answers: