How do you import and use Spark-Koalas in palantir-foundry

Question:

How can I — in Palantir-foundry — import and use the "Koalas: pandas API for Apache Spark" open source python package.

I know that you can import packages that don’t exist through Code Repo and have done this, can I do this same process for Koalas package or do I need to follow another route?

Asked By: Jeremy David Gamet

||

Answers:

I was able to use Code Repo to upload a local clone of the package and then add the package in platform using the steps detailed here: How to create python libraries and how to import it in palantir foundry

However, shortly afterwards Palantir admins introduced an update which included the Koalas package as a native package to the platform. I have not however had time to try using this for any major tasks as of yet.

Answered By: Jeremy David Gamet

Koalas is officially included in PySpark as **pandas API on Spark** in Apache Spark 3.2. In Spark 3.2+, you no longer need to import koalas, as it comes with pyspark. The only required action is to add pandas and pyarrow as these are required dependencies that Code Repositories don’t include by default. You can do so via Libraries tab.

enter image description here

You can confirm that it works using this test transform

@transform_df(
    Output("OUTPUT_DATASET_PATH"),
)
def compute():
    psdf = ps.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])
    return psdf.to_spark()

To confirm that you are using Spark 3.2+ in your Code repository, please merge any pending upgrade PRs. Prior to Spark 3.2, it was possible to import koalas through Libraries tab

Answered By: proggeo