Is it possible to allow users to download the result of a pyspark dataframe in FastAPI or Flask
Question:
I’m working on an API using FastAPI that users can make a request to in order for the following to happen:
- First, a get request will grab a file from Google Cloud Storage and load it into a pyspark DataFrame
- Then the application will perform some transformations on the DataFrame
- Finally, I want to write the DataFrame to the user’s disk as a parquet file.
I can’t quite figure out how to deliver the file to the user in parquet format, for a few reasons:
df.write.parquet('out/path.parquet')
writes the data into a directory at out/path.parquet
which presents a challenge when I try to pass it to starlette.responses.FileResponse
- Passing a single .parquet file that I know exists to
starlette.responses.FileResponse
just seems to print the binary to my console (as demonstrated in my code below)
- Writing the DataFrame to a BytesIO stream like in pandas seemed promising, but I can’t quite figure out how to do that using any of DataFrame’s methods or DataFrame.rdd’s methods.
Is this even possible in FastAPI? Is it possible in Flask using send_file()?
Here’s the code I have so far. Note that I’ve tried a few things like the commented code to no avail.
import tempfile
from fastapi import APIRouter
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from starlette.responses import FileResponse
router = APIRouter()
sc = SparkContext('local')
spark = SparkSession(sc)
df: spark.createDataFrame = spark.read.parquet('gs://my-bucket/sample-data/my.parquet')
@router.get("/applications")
def applications():
df.write.parquet("temp.parquet", compression="snappy")
return FileResponse("part-some-compressed-file.snappy.parquet")
# with tempfile.TemporaryFile() as f:
# f.write(df.rdd.saveAsPickleFile("temp.parquet"))
# return FileResponse("test.parquet")
Thanks!
Edit:
I tried using the answers and info provided here, but I can’t quite get it working.
Answers:
I was able to solve the issue, but it’s far from elegant. If anyone can provide a solution that doesn’t write to disk, I will greatly appreciate it, and will select your answer as the correct one.
I was able to serialize the DataFrame using df.rdd.saveAsPickleFile()
, zip the resulting directory, pass it to a python client, write the resulting zipfile to disk, unzip it, then use SparkContext().pickleFile
before finally loading the DataFrame. Far from ideal, I think.
API:
import shutil
import tempfile
from fastapi import APIRouter
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from starlette.responses import FileResponse
router = APIRouter()
sc = SparkContext('local')
spark = SparkSession(sc)
df: spark.createDataFrame = spark.read.parquet('gs://my-bucket/my-file.parquet')
@router.get("/applications")
def applications():
temp_parquet = tempfile.NamedTemporaryFile()
temp_parquet.close()
df.rdd.saveAsPickleFile(temp_parquet.name)
shutil.make_archive('test', 'zip', temp_parquet.name)
return FileResponse('test.zip')
Client:
import io
import zipfile
import requests
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
response = requests.get("http://0.0.0.0:5000/applications")
file_like_object = io.BytesIO(response.content)
with zipfile.ZipFile(file_like_object) as z:
z.extractall('temp.data')
rdd = sc.pickleFile("temp.data")
df = spark.createDataFrame(rdd)
print(df.head())
I’m working on an API using FastAPI that users can make a request to in order for the following to happen:
- First, a get request will grab a file from Google Cloud Storage and load it into a pyspark DataFrame
- Then the application will perform some transformations on the DataFrame
- Finally, I want to write the DataFrame to the user’s disk as a parquet file.
I can’t quite figure out how to deliver the file to the user in parquet format, for a few reasons:
df.write.parquet('out/path.parquet')
writes the data into a directory atout/path.parquet
which presents a challenge when I try to pass it tostarlette.responses.FileResponse
- Passing a single .parquet file that I know exists to
starlette.responses.FileResponse
just seems to print the binary to my console (as demonstrated in my code below) - Writing the DataFrame to a BytesIO stream like in pandas seemed promising, but I can’t quite figure out how to do that using any of DataFrame’s methods or DataFrame.rdd’s methods.
Is this even possible in FastAPI? Is it possible in Flask using send_file()?
Here’s the code I have so far. Note that I’ve tried a few things like the commented code to no avail.
import tempfile
from fastapi import APIRouter
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from starlette.responses import FileResponse
router = APIRouter()
sc = SparkContext('local')
spark = SparkSession(sc)
df: spark.createDataFrame = spark.read.parquet('gs://my-bucket/sample-data/my.parquet')
@router.get("/applications")
def applications():
df.write.parquet("temp.parquet", compression="snappy")
return FileResponse("part-some-compressed-file.snappy.parquet")
# with tempfile.TemporaryFile() as f:
# f.write(df.rdd.saveAsPickleFile("temp.parquet"))
# return FileResponse("test.parquet")
Thanks!
Edit:
I tried using the answers and info provided here, but I can’t quite get it working.
I was able to solve the issue, but it’s far from elegant. If anyone can provide a solution that doesn’t write to disk, I will greatly appreciate it, and will select your answer as the correct one.
I was able to serialize the DataFrame using df.rdd.saveAsPickleFile()
, zip the resulting directory, pass it to a python client, write the resulting zipfile to disk, unzip it, then use SparkContext().pickleFile
before finally loading the DataFrame. Far from ideal, I think.
API:
import shutil
import tempfile
from fastapi import APIRouter
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from starlette.responses import FileResponse
router = APIRouter()
sc = SparkContext('local')
spark = SparkSession(sc)
df: spark.createDataFrame = spark.read.parquet('gs://my-bucket/my-file.parquet')
@router.get("/applications")
def applications():
temp_parquet = tempfile.NamedTemporaryFile()
temp_parquet.close()
df.rdd.saveAsPickleFile(temp_parquet.name)
shutil.make_archive('test', 'zip', temp_parquet.name)
return FileResponse('test.zip')
Client:
import io
import zipfile
import requests
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
response = requests.get("http://0.0.0.0:5000/applications")
file_like_object = io.BytesIO(response.content)
with zipfile.ZipFile(file_like_object) as z:
z.extractall('temp.data')
rdd = sc.pickleFile("temp.data")
df = spark.createDataFrame(rdd)
print(df.head())