How to transform Spark dataframe to Polars dataframe?
Question:
I wonder how i can transform Spark dataframe to Polars dataframe.
Let’s say i have this code on PySpark:
df = spark.sql('''select * from tmp''')
I can easily transform it to pandas dataframe using .toPandas
.
Is there something similar in polars, as I need to get a polars dataframe for further processing?
Answers:
You can’t directly convert from spark to polars. But you can go from spark to pandas, then create a dictionary out of the pandas data, and pass it to polars like this:
pandas_df = df.toPandas()
data = pandas_df.to_dict('list')
pl_df = pl.DataFrame(data)
As @ritchie46 pointed out, you can use pl.from_pandas()
instead of creating a dictionary:
pandas_df = df.toPandas()
pl_df = pl.from_pandas(pandas_df)
Also, as mentioned in @DataPsycho’s answer, this may cause out of memory exception for large datasets. This is because toPandas()
will collect the data to the driver first. In this case, it is better to write to csv or parquet file and then read back. But avoid repartition(1)
because this will move the data to the driver too.
The code I have provided is suitable for datasets that will fit in your driver memory. If you have the option to increase the driver memory you can do so by increasing the value of spark.driver.memory
.
It will be good to know your usecase. Heavy transformations you should do either with spark or polars. You should not be mixing both dataframes. What ever polars can do spark can do all of them. So you should be doing all of your transformation with spark. Then write the file as csv or parquet format. Then You should read the transformed file with Polars and everything will run blazing fast, But if you are interested in plotting then read it directly into pandas and use matplotlib. So if you will have a spark dataframe you can write it as csv:
(transformed_df
.repartition(1)
.write
.option("header",true)
.option("delimiter",",") # by default it is ,
.csv("<your_path>")
)
Now read it with polars or pandas with read_csv
. If you will have small amount of memory in the drive node of spark then transformed_df.toPandas()
might fail because of not having much memory.
Context
Pyspark uses arrow to convert to pandas. Polars is an abstraction over arrow memory. So we can hijack the API that spark uses internally to create the arrow data and use that to create the polars DataFrame
.
TLDR
Given an spark context we can write:
import pyarrow as pa
import polars as pl
sql_context = SQLContext(spark)
data = [('James',[1, 2]),]
spark_df = sql_context.createDataFrame(data=data, schema = ["name","properties"])
df = pl.from_arrow(pa.Table.from_batches(spark_df._collect_as_arrow()))
print(df)
shape: (1, 2)
┌───────┬────────────┐
│ name ┆ properties │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═══════╪════════════╡
│ James ┆ [1, 2] │
└───────┴────────────┘
Serialization steps
This will actually be faster than the toPandas
provided by spark
itself, because it saves an extra copy.
toPandas()
will lead to this serialization/copy step:
spark-memory -> arrow-memory -> pandas-memory
With the query provided we have:
spark-memory -> arrow/polars-memory
Polars is not distributed, while Spark is
Note that Polars is a single-machine multi-threaded DataFrame library. Spark in contrast is a multi-machine multi-threaded DataFrame library. So Spark distributes the DataFrame across multiple machines.
Transform Spark DataFrame with Polars code scalable
If your dataset requires this feature, because the DataFrame does not fit onto a single machine, then _collect_as_arrow
, to_dict
and from_pandas
do not work for you.
If you want to transform your Spark DataFrame using some Polars code (Spark -> Polars -> Spark), you can do this distributed and scalable using mapInArrow
:
import pyarrow as pa
import polars as pl
from typing import Iterator
# The example data as a Spark DataFrame
data = [(1, 1.0), (2, 2.0)]
spark_df = spark.createDataFrame(data=data, schema = ['id', 'value'])
spark_df.show()
# Define your transformation on a Polars DataFrame
# Here we multply the 'value' column by 2
def polars_transform(df: pl.DataFrame) -> pl.DataFrame:
return df.select([
pl.col('id'),
pl.col('value') * 2
])
# Converts a part of the Spark DataFrame into a Polars DataFrame and call `polars_transform` on it
def arrow_transform(iter: Iterator[pa.RecordBatch]) -> Iterator[pa.RecordBatch]:
# Transform a single RecordBatch so data fit into memory
# Increase spark.sql.execution.arrow.maxRecordsPerBatch if batches are too small
for batch in iter:
polars_df = pl.from_arrow(pa.Table.from_batches([batch]))
polars_df_2 = polars_transform(polars_df)
for b in polars_df_2.to_arrow().to_batches():
yield b
# Map the Spark DataFrame to Arrow, then to Polars, run the the `polars_transform` on it,
# and transform everything back to Spark DataFrame, all distributed and scalable
spark_df_2 = spark_df.mapInArrow(arrow_transform, schema='id long, value double')
spark_df_2.show()
I wonder how i can transform Spark dataframe to Polars dataframe.
Let’s say i have this code on PySpark:
df = spark.sql('''select * from tmp''')
I can easily transform it to pandas dataframe using .toPandas
.
Is there something similar in polars, as I need to get a polars dataframe for further processing?
You can’t directly convert from spark to polars. But you can go from spark to pandas, then create a dictionary out of the pandas data, and pass it to polars like this:
pandas_df = df.toPandas()
data = pandas_df.to_dict('list')
pl_df = pl.DataFrame(data)
As @ritchie46 pointed out, you can use pl.from_pandas()
instead of creating a dictionary:
pandas_df = df.toPandas()
pl_df = pl.from_pandas(pandas_df)
Also, as mentioned in @DataPsycho’s answer, this may cause out of memory exception for large datasets. This is because toPandas()
will collect the data to the driver first. In this case, it is better to write to csv or parquet file and then read back. But avoid repartition(1)
because this will move the data to the driver too.
The code I have provided is suitable for datasets that will fit in your driver memory. If you have the option to increase the driver memory you can do so by increasing the value of spark.driver.memory
.
It will be good to know your usecase. Heavy transformations you should do either with spark or polars. You should not be mixing both dataframes. What ever polars can do spark can do all of them. So you should be doing all of your transformation with spark. Then write the file as csv or parquet format. Then You should read the transformed file with Polars and everything will run blazing fast, But if you are interested in plotting then read it directly into pandas and use matplotlib. So if you will have a spark dataframe you can write it as csv:
(transformed_df
.repartition(1)
.write
.option("header",true)
.option("delimiter",",") # by default it is ,
.csv("<your_path>")
)
Now read it with polars or pandas with read_csv
. If you will have small amount of memory in the drive node of spark then transformed_df.toPandas()
might fail because of not having much memory.
Context
Pyspark uses arrow to convert to pandas. Polars is an abstraction over arrow memory. So we can hijack the API that spark uses internally to create the arrow data and use that to create the polars DataFrame
.
TLDR
Given an spark context we can write:
import pyarrow as pa
import polars as pl
sql_context = SQLContext(spark)
data = [('James',[1, 2]),]
spark_df = sql_context.createDataFrame(data=data, schema = ["name","properties"])
df = pl.from_arrow(pa.Table.from_batches(spark_df._collect_as_arrow()))
print(df)
shape: (1, 2)
┌───────┬────────────┐
│ name ┆ properties │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═══════╪════════════╡
│ James ┆ [1, 2] │
└───────┴────────────┘
Serialization steps
This will actually be faster than the toPandas
provided by spark
itself, because it saves an extra copy.
toPandas()
will lead to this serialization/copy step:
spark-memory -> arrow-memory -> pandas-memory
With the query provided we have:
spark-memory -> arrow/polars-memory
Polars is not distributed, while Spark is
Note that Polars is a single-machine multi-threaded DataFrame library. Spark in contrast is a multi-machine multi-threaded DataFrame library. So Spark distributes the DataFrame across multiple machines.
Transform Spark DataFrame with Polars code scalable
If your dataset requires this feature, because the DataFrame does not fit onto a single machine, then _collect_as_arrow
, to_dict
and from_pandas
do not work for you.
If you want to transform your Spark DataFrame using some Polars code (Spark -> Polars -> Spark), you can do this distributed and scalable using mapInArrow
:
import pyarrow as pa
import polars as pl
from typing import Iterator
# The example data as a Spark DataFrame
data = [(1, 1.0), (2, 2.0)]
spark_df = spark.createDataFrame(data=data, schema = ['id', 'value'])
spark_df.show()
# Define your transformation on a Polars DataFrame
# Here we multply the 'value' column by 2
def polars_transform(df: pl.DataFrame) -> pl.DataFrame:
return df.select([
pl.col('id'),
pl.col('value') * 2
])
# Converts a part of the Spark DataFrame into a Polars DataFrame and call `polars_transform` on it
def arrow_transform(iter: Iterator[pa.RecordBatch]) -> Iterator[pa.RecordBatch]:
# Transform a single RecordBatch so data fit into memory
# Increase spark.sql.execution.arrow.maxRecordsPerBatch if batches are too small
for batch in iter:
polars_df = pl.from_arrow(pa.Table.from_batches([batch]))
polars_df_2 = polars_transform(polars_df)
for b in polars_df_2.to_arrow().to_batches():
yield b
# Map the Spark DataFrame to Arrow, then to Polars, run the the `polars_transform` on it,
# and transform everything back to Spark DataFrame, all distributed and scalable
spark_df_2 = spark_df.mapInArrow(arrow_transform, schema='id long, value double')
spark_df_2.show()