Access to Spark from Flask app

Question:

I wrote a simple Flask app to pass some data to Spark. The script works in IPython Notebook, but not when I try to run it in it’s own server. I don’t think that the Spark context is running within the script. How do I get Spark working in the following example?

from flask import Flask, request
from pyspark import SparkConf, SparkContext

app = Flask(__name__)

conf = SparkConf()
conf.setMaster("local")
conf.setAppName("SparkContext1")
conf.set("spark.executor.memory", "1g")
sc = SparkContext(conf=conf)

@app.route('/accessFunction', methods=['POST'])
def toyFunction():
    posted_data = sc.parallelize([request.get_data()])
    return str(posted_data.collect()[0])

if __name__ == '__main_':
    app.run(port=8080)

In IPython Notebook I don’t define the SparkContext because it is automatically configured. I don’t remember how I did this, I followed some blogs.

On the Linux server I have set the .py to always be running and installed the latest Spark by following up to step 5 of this guide.

Edit:

Following the advice by davidism I have now instead resorted to simple programs with increasing complexity to localise the error.

Firstly I created .py with just the script from the answer below (after appropriately adjusting the links):

import sys
try:
    sys.path.append("your/spark/home/python")
    from pyspark import context
    print ("Successfully imported Spark Modules")
except ImportError as e:
    print ("Can not import Spark Modules", e)

This returns “Successfully imported Spark Modules”. However, the next .py file I made returns an exception:

from pyspark import SparkContext
sc = SparkContext('local')
rdd = sc.parallelize([0])
print rdd.count()

This returns exception:

“Java gateway process exited before sending the driver its port number”

Searching around for similar problems I found this page but when I run this code nothing happens, no print on the console and no error messages. Similarly, this did not help either, I get the same Java gateway exception as above. I have also installed anaconda as I heard this may help unite python and java, again no success…

Any suggestions about what to try next? I am at a loss.

Asked By: Matt

||

Answers:

Modify your .py file as it is shown in the linked guide ‘Using IPython Notebook with Spark’ part second point. Insted sys.path.insert use sys.path.append. Try insert this snippet:

import sys
try:
    sys.path.append("your/spark/home/python")
    from pyspark import context
    print ("Successfully imported Spark Modules")
except ImportError as e:
    print ("Can not import Spark Modules", e)
Answered By: szentesmarci

Okay, so I’m going to answer my own question in the hope that someone out there won’t suffer the same days of frustration! It turns out it was a combination of missing code and bad set up.

Editing the code:
I did indeed need to initialise a Spark Context by appending the following in the preamble of my code:

from pyspark import SparkContext
sc = SparkContext('local')

So the full code will be:

from pyspark import SparkContext
sc = SparkContext('local')

from flask import Flask, request
app = Flask(__name__)

@app.route('/whateverYouWant', methods=['POST']) # can set first param to '/'
def toyFunction():
    posted_data = sc.parallelize([request.get_data()])
    return str(posted_data.collect()[0])

if __name__ == '__main_':
    app.run(port=8080)    #note set to 8080!

Editing the setup:
It is essential that the file (yourfilename.py) is in the correct directory, namely it must be saved to the folder /home/ubuntu/spark-1.5.0-bin-hadoop2.6.

Then issue the following command within the directory:

./bin/spark-submit yourfilename.py

which initiates the service at 10.0.0.XX:8080/whateverYouWant/.

Note that the port must be set to 8080 or 8081: Spark only allows web UI for these ports by default for master and worker respectively

You can test out the service with a restful service or by opening up a new terminal and sending POST requests with cURL commands:

curl --data "DATA YOU WANT TO POST" http://10.0.0.XX/8080/whateverYouWant/
Answered By: Matt

I was able to fix this problem by adding the location of PySpark and py4j to the path in my flaskapp.wsgi file. Here’s the full content:

import sys
sys.path.insert(0, '/var/www/html/flaskapp')
sys.path.insert(1, '/usr/local/spark-2.0.2-bin-hadoop2.7/python')
sys.path.insert(2, '/usr/local/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip')

from flaskapp import app as application
Answered By: xvladus1
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.