Fitting sklearn model inside pandas UDF is taking too long -PYSPARK-

Question:

I have a spark.DataFrame, with multiples time series. I want to apply a sklearn model for each time serie on a groupby apply. For each time series, the model takes app 0.05s, but when I try to solve this inside a pandas_udf, it takes much longer than applying it sequentially. Here is an example

def forecaster_spark(data_group: pd.DataFrame):
    # index for reports
    item_id = data_group["item_id"].iloc[0]
    # Indexing by time index
    data_group = data_group.set_index(pd.DatetimeIndex(data_group['ds'])).sort_index()
    # Here we extract the time series
    y = data_group["y"].astype(float)
    # transform is a transformation to build regressor features (example extracting lags)
    X = transform(y)
    model = xg.XGBRegressor(max_depth = 50)
    # For each item, the model takes app 0.05s
    model.fit(X[:-1],y[-2])
    return pd.DataFrame({"item_id": item_id, "pred": model.predict(X.iloc[-1], y[-1])})

Here transform is a transformation applied on y to build a regressor matrix, for instance, could be thought of as a transformation that builds the lags.

Then I apply this on a spark.DataFrame containing all data

predictions = data.applyInPandas(forecaster_spark, schema="item_id string, y_pred double")

When I do an action on predictions (for example toPandas or show(1000)) it takes much time. When I comment model.fit on forecaster_spark (changing the pred return for any value) it finishes very fast, so the problem is on fitting the model inside the UDF. I thought that it could be a problem with the partition of the data frame when using groupby, but I tried with many simple UDF functions (for example taking mean) and the behavior was normal. Here is a function that works perfectly

def test_function_for_udf(data_group: pd.DataFrame):
    """Same function as above, but commenting fit method"""
    # index for reports
    item_id = data_group["item_id"].iloc[0]
    # Indexing by time index
    data_group = data_group.set_index(pd.DatetimeIndex(data_group['ds'])).sort_index()
    # Here we extract the time series
    y = data_group["y"].astype(float)
    # transform is a transformation to build regressor features (example extracting lags)
    X = transform(y)
    model = xg.XGBRegressor(max_depth = 50)
    # Note that I commented the fit
    # model.fit(X[:-1],y[-2])
    return pd.DataFrame({"item_id": item_id, "mean": y.mean()})

Asked By: Andrex

||

Answers:

  1. If you want it to run faster, try translate to spark pandas instead of pandas. (Admittedly it’s not clear you could be using spark pandas.)
  2. UDF’s generally don’t perform well. (They used to not get vectorized, but there has been recent work on this.) If you want better performance, consider re-writing this as .mapPartitions This will allow you to keep heavy objects in memory(model = xg.XGBRegressor(max_depth = 50)), instead of creating them to throw them away –> what happens in a UDF.
  3. You will also likely get a better performance if you change the code to use a yield instead of return(within .mapParitions). This is because spark uses lazy iterators and this can help with JVM memory pressure. (This also allows data to be spilt from memory.) This isn’t a magic bullet but a tool you might benefit from, that could be beneficial depending on the size of your group.
Answered By: Matt Andruff

You can perform distributed training of XGBoost models using either:

  • xgboost.spark – probably the easiest option, but still experimental and needs version>=1.7.0
  • sparkdl.xgboost

Databricks has some helpful tutorials on this:

If you need other algorithms MLlib provides distributed versions of many of the most common ones

Answered By: tomlincr
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.