How to transform a Polars DataFrame to a pySpark DataFrame?
Question:
How to correctly transform a Polars DataFrame to a pySpark DataFrame?
More specifically, the conversion methods which I’ve tried all seem to have problems parsing columns containing arrays / lists.
create spark dataframe
data = [{"id": 1, "strings": ['A', 'C'], "floats": [0.12, 0.43]},
{"id": 2, "strings": ['B', 'B'], "floats": [0.01]},
{"id": 3, "strings": ['C'], "floats": [0.09, 0.01]}
]
sparkdf = spark.createDataFrame(data)
convert it to polars
import pyarrow as pa
import polars as pl
pldf = pl.from_arrow(pa.Table.from_batches(sparkdf._collect_as_arrow()))
try to convert back to spark dataframe (attempt 1)
spark.createDataFrame(pldf.to_pandas())
TypeError: Can not infer schema for type: <class 'numpy.ndarray'>
TypeError: Unable to infer the type of the field floats.
try to convert back to spark dataframe (attempt 2)
schema = sparkdf.schema
spark.createDataFrame(pldf.to_pandas(), schema)
TypeError: field floats: ArrayType(DoubleType(), True) can not accept object array([0.12, 0.43]) in type <class 'numpy.ndarray'>
relevant: How to transform Spark dataframe to Polars dataframe?
Answers:
DataFrame.transform(func: Callable [ […], DataFrame], *args: Any, **kwargs: Any) → pyspark.sql.dataframe.DataFrame [source] ¶ Returns a new DataFrame. Concise syntax for chaining custom transformations.
What about
spark.createDataFrame(pldf.to_dicts())
Alternatively you could do:
spark.createDataFrame({x:y.to_list() for x,y in pldf.to_dict().items()})
Since the to_dict
method returns polars Series instead of lists, I’m using a comprehension to convert the Series into regular lists which spark comprehends.
I discovered the right way to do this while reading ‘In-Memory Analytics with Apache Arrow’ by Matthew Topol.
You can do:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_spark = spark.createDataFrame(df_polars.to_pandas())
It’s quite fast.
I also tried converting to an Arrow-backed Pandas dataframe first (i.e. df_polars.to_pandas(use_pyarrow_extension_array=True)
) but it does not work: Spark complains that it does not know how to handle column types such as large strings (the UTF8 in Polars) or unsigned integers.
Not setting spark.sql.execution.arrow.pyspark.enabled
to true increases the time more than ten-fold in my test.
How to correctly transform a Polars DataFrame to a pySpark DataFrame?
More specifically, the conversion methods which I’ve tried all seem to have problems parsing columns containing arrays / lists.
create spark dataframe
data = [{"id": 1, "strings": ['A', 'C'], "floats": [0.12, 0.43]},
{"id": 2, "strings": ['B', 'B'], "floats": [0.01]},
{"id": 3, "strings": ['C'], "floats": [0.09, 0.01]}
]
sparkdf = spark.createDataFrame(data)
convert it to polars
import pyarrow as pa
import polars as pl
pldf = pl.from_arrow(pa.Table.from_batches(sparkdf._collect_as_arrow()))
try to convert back to spark dataframe (attempt 1)
spark.createDataFrame(pldf.to_pandas())
TypeError: Can not infer schema for type: <class 'numpy.ndarray'>
TypeError: Unable to infer the type of the field floats.
try to convert back to spark dataframe (attempt 2)
schema = sparkdf.schema
spark.createDataFrame(pldf.to_pandas(), schema)
TypeError: field floats: ArrayType(DoubleType(), True) can not accept object array([0.12, 0.43]) in type <class 'numpy.ndarray'>
relevant: How to transform Spark dataframe to Polars dataframe?
DataFrame.transform(func: Callable [ […], DataFrame], *args: Any, **kwargs: Any) → pyspark.sql.dataframe.DataFrame [source] ¶ Returns a new DataFrame. Concise syntax for chaining custom transformations.
What about
spark.createDataFrame(pldf.to_dicts())
Alternatively you could do:
spark.createDataFrame({x:y.to_list() for x,y in pldf.to_dict().items()})
Since the to_dict
method returns polars Series instead of lists, I’m using a comprehension to convert the Series into regular lists which spark comprehends.
I discovered the right way to do this while reading ‘In-Memory Analytics with Apache Arrow’ by Matthew Topol.
You can do:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_spark = spark.createDataFrame(df_polars.to_pandas())
It’s quite fast.
I also tried converting to an Arrow-backed Pandas dataframe first (i.e. df_polars.to_pandas(use_pyarrow_extension_array=True)
) but it does not work: Spark complains that it does not know how to handle column types such as large strings (the UTF8 in Polars) or unsigned integers.
Not setting spark.sql.execution.arrow.pyspark.enabled
to true increases the time more than ten-fold in my test.