Is it possible to allow users to download the result of a pyspark dataframe in FastAPI or Flask

Question:

I’m working on an API using FastAPI that users can make a request to in order for the following to happen:

  1. First, a get request will grab a file from Google Cloud Storage and load it into a pyspark DataFrame
  2. Then the application will perform some transformations on the DataFrame
  3. Finally, I want to write the DataFrame to the user’s disk as a parquet file.

I can’t quite figure out how to deliver the file to the user in parquet format, for a few reasons:

  • df.write.parquet('out/path.parquet') writes the data into a directory at out/path.parquet which presents a challenge when I try to pass it to starlette.responses.FileResponse
  • Passing a single .parquet file that I know exists to starlette.responses.FileResponse just seems to print the binary to my console (as demonstrated in my code below)
  • Writing the DataFrame to a BytesIO stream like in pandas seemed promising, but I can’t quite figure out how to do that using any of DataFrame’s methods or DataFrame.rdd’s methods.

Is this even possible in FastAPI? Is it possible in Flask using send_file()?

Here’s the code I have so far. Note that I’ve tried a few things like the commented code to no avail.

import tempfile

from fastapi import APIRouter
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from starlette.responses import FileResponse


router = APIRouter()
sc = SparkContext('local')
spark = SparkSession(sc)

df: spark.createDataFrame = spark.read.parquet('gs://my-bucket/sample-data/my.parquet')

@router.get("/applications")
def applications():
    df.write.parquet("temp.parquet", compression="snappy")
    return FileResponse("part-some-compressed-file.snappy.parquet")
    # with tempfile.TemporaryFile() as f:
    #     f.write(df.rdd.saveAsPickleFile("temp.parquet"))
    #     return FileResponse("test.parquet")

Thanks!

Edit:
I tried using the answers and info provided here, but I can’t quite get it working.

Asked By: RNHTTR

||

Answers:

I was able to solve the issue, but it’s far from elegant. If anyone can provide a solution that doesn’t write to disk, I will greatly appreciate it, and will select your answer as the correct one.

I was able to serialize the DataFrame using df.rdd.saveAsPickleFile(), zip the resulting directory, pass it to a python client, write the resulting zipfile to disk, unzip it, then use SparkContext().pickleFile before finally loading the DataFrame. Far from ideal, I think.

API:

import shutil
import tempfile

from fastapi import APIRouter
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from starlette.responses import FileResponse


router = APIRouter()
sc = SparkContext('local')
spark = SparkSession(sc)

df: spark.createDataFrame = spark.read.parquet('gs://my-bucket/my-file.parquet')

@router.get("/applications")
def applications():
    temp_parquet = tempfile.NamedTemporaryFile()
    temp_parquet.close()
    df.rdd.saveAsPickleFile(temp_parquet.name)

    shutil.make_archive('test', 'zip', temp_parquet.name)

    return FileResponse('test.zip')

Client:

import io
import zipfile

import requests

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext('local')
spark = SparkSession(sc)

response = requests.get("http://0.0.0.0:5000/applications")
file_like_object = io.BytesIO(response.content)
with zipfile.ZipFile(file_like_object) as z:
    z.extractall('temp.data')

rdd = sc.pickleFile("temp.data")
df = spark.createDataFrame(rdd)

print(df.head())
Answered By: RNHTTR
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.