How to transform a Polars DataFrame to a pySpark DataFrame?

Question:

How to correctly transform a Polars DataFrame to a pySpark DataFrame?

More specifically, the conversion methods which I’ve tried all seem to have problems parsing columns containing arrays / lists.

create spark dataframe

data = [{"id": 1, "strings": ['A', 'C'], "floats": [0.12, 0.43]},
        {"id": 2, "strings": ['B', 'B'], "floats": [0.01]},
        {"id": 3, "strings": ['C'], "floats": [0.09, 0.01]}
        ]

sparkdf = spark.createDataFrame(data)

convert it to polars

import pyarrow as pa
import polars as pl
pldf = pl.from_arrow(pa.Table.from_batches(sparkdf._collect_as_arrow()))

try to convert back to spark dataframe (attempt 1)

spark.createDataFrame(pldf.to_pandas())


TypeError: Can not infer schema for type: <class 'numpy.ndarray'>
TypeError: Unable to infer the type of the field floats.

try to convert back to spark dataframe (attempt 2)

schema = sparkdf.schema
spark.createDataFrame(pldf.to_pandas(), schema)

TypeError: field floats: ArrayType(DoubleType(), True) can not accept object array([0.12, 0.43]) in type <class 'numpy.ndarray'>

relevant: How to transform Spark dataframe to Polars dataframe?

Asked By: ihopethiswillfi

||

Answers:

DataFrame.transform(func: Callable [ […], DataFrame], *args: Any, **kwargs: Any) → pyspark.sql.dataframe.DataFrame [source] ¶ Returns a new DataFrame. Concise syntax for chaining custom transformations.

Answered By: Anurag negi

What about

spark.createDataFrame(pldf.to_dicts())

Alternatively you could do:

spark.createDataFrame({x:y.to_list() for x,y in pldf.to_dict().items()})

Since the to_dict method returns polars Series instead of lists, I’m using a comprehension to convert the Series into regular lists which spark comprehends.

Answered By: Dean MacGregor

I discovered the right way to do this while reading ‘In-Memory Analytics with Apache Arrow’ by Matthew Topol.

You can do:

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_spark = spark.createDataFrame(df_polars.to_pandas())

It’s quite fast.

I also tried converting to an Arrow-backed Pandas dataframe first (i.e. df_polars.to_pandas(use_pyarrow_extension_array=True)) but it does not work: Spark complains that it does not know how to handle column types such as large strings (the UTF8 in Polars) or unsigned integers.

Not setting spark.sql.execution.arrow.pyspark.enabled to true increases the time more than ten-fold in my test.

Answered By: Luca
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.