Unable to stream API response in Flask application on Google Cloud application
Question:
I’m developing a little testing website using the OpenAI API. I’m trying to stream GPT’s response, just like how it’s done on https://chat.openai.com/chat. This works just fine when running my Flask application on a local development server, but when I deploy this app to Google Cloud, the response is given in one go, instead of being streamed. I have tried disabling buffering according to https://cloud.google.com/appengine/docs/flexible/how-requests-are-handled?tab=python#x-accel-buffering, but that didn’t resolve the issue. I suspect that my issue lies in how I’m configuring my app on Google Cloud (or the lack thereof).
This is what I’ve got going on currently, this works when running the application locally.
main.py
@app.route('/stream_response', methods=['POST'])
def stream_response():
prompt_text = request.form['prompt']
def generate():
for chunk in gpt_model.get_response(prompt_text, stream=True):
for choice in chunk['choices']:
dictionary: dict = choice['delta']
if 'content' in dictionary:
yield dictionary['content']
response = Response(generate(), content_type='text/html')
response.headers['X-Accel-Buffering'] = 'no'
return response
prompt.html
<script>
function streamResponse() {
var promptText = document.getElementById("prompt").value;
var xhr = new XMLHttpRequest();
xhr.open("POST", "/stream_response", true);
xhr.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
xhr.onprogress = function () {
document.getElementById("response-container").style.display = "block";
document.getElementById("response").innerHTML = xhr.responseText;
console.log(xhr.responseText)
};
xhr.send("prompt=" + encodeURIComponent(promptText));
}
</script>
Google Cloud app.yaml
runtime: python310
handlers:
- url: /.*
script: auto
Google Cloud deployment process
gcloud app deploy
Answers:
From your app.yaml
, it means you’re deploying to Google App Engine (GAE) Standard Environment. GAE doesn’t support streaming – see doc where it says
App Engine does not support streaming responses where data is sent in incremental chunks to the client while a request is being processed. All data from your code is collected as described above and sent as a single HTTP response.
Chat UIs also usually require sockets. GAE Standard doesn’t support web sockets but GAE Flex does (see docs)
I’m developing a little testing website using the OpenAI API. I’m trying to stream GPT’s response, just like how it’s done on https://chat.openai.com/chat. This works just fine when running my Flask application on a local development server, but when I deploy this app to Google Cloud, the response is given in one go, instead of being streamed. I have tried disabling buffering according to https://cloud.google.com/appengine/docs/flexible/how-requests-are-handled?tab=python#x-accel-buffering, but that didn’t resolve the issue. I suspect that my issue lies in how I’m configuring my app on Google Cloud (or the lack thereof).
This is what I’ve got going on currently, this works when running the application locally.
main.py
@app.route('/stream_response', methods=['POST'])
def stream_response():
prompt_text = request.form['prompt']
def generate():
for chunk in gpt_model.get_response(prompt_text, stream=True):
for choice in chunk['choices']:
dictionary: dict = choice['delta']
if 'content' in dictionary:
yield dictionary['content']
response = Response(generate(), content_type='text/html')
response.headers['X-Accel-Buffering'] = 'no'
return response
prompt.html
<script>
function streamResponse() {
var promptText = document.getElementById("prompt").value;
var xhr = new XMLHttpRequest();
xhr.open("POST", "/stream_response", true);
xhr.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
xhr.onprogress = function () {
document.getElementById("response-container").style.display = "block";
document.getElementById("response").innerHTML = xhr.responseText;
console.log(xhr.responseText)
};
xhr.send("prompt=" + encodeURIComponent(promptText));
}
</script>
Google Cloud app.yaml
runtime: python310
handlers:
- url: /.*
script: auto
Google Cloud deployment process
gcloud app deploy
From your app.yaml
, it means you’re deploying to Google App Engine (GAE) Standard Environment. GAE doesn’t support streaming – see doc where it says
App Engine does not support streaming responses where data is sent in incremental chunks to the client while a request is being processed. All data from your code is collected as described above and sent as a single HTTP response.
Chat UIs also usually require sockets. GAE Standard doesn’t support web sockets but GAE Flex does (see docs)