Unable to stream API response in Flask application on Google Cloud application

Question:

I’m developing a little testing website using the OpenAI API. I’m trying to stream GPT’s response, just like how it’s done on https://chat.openai.com/chat. This works just fine when running my Flask application on a local development server, but when I deploy this app to Google Cloud, the response is given in one go, instead of being streamed. I have tried disabling buffering according to https://cloud.google.com/appengine/docs/flexible/how-requests-are-handled?tab=python#x-accel-buffering, but that didn’t resolve the issue. I suspect that my issue lies in how I’m configuring my app on Google Cloud (or the lack thereof).

This is what I’ve got going on currently, this works when running the application locally.

main.py

@app.route('/stream_response', methods=['POST'])
def stream_response():
    prompt_text = request.form['prompt']

    def generate():
        for chunk in gpt_model.get_response(prompt_text, stream=True):
            for choice in chunk['choices']:
                dictionary: dict = choice['delta']
                if 'content' in dictionary:
                    yield dictionary['content']
    
    response = Response(generate(), content_type='text/html')
    response.headers['X-Accel-Buffering'] = 'no'
    return response

prompt.html

<script>
    function streamResponse() {
        var promptText = document.getElementById("prompt").value;
        var xhr = new XMLHttpRequest();
        xhr.open("POST", "/stream_response", true);
        xhr.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
        xhr.onprogress = function () {
            document.getElementById("response-container").style.display = "block";
            document.getElementById("response").innerHTML = xhr.responseText;
            console.log(xhr.responseText)
        };
        xhr.send("prompt=" + encodeURIComponent(promptText));
    }
</script>

Google Cloud app.yaml

runtime: python310

handlers:
- url: /.*
  script: auto

Google Cloud deployment process

gcloud app deploy
Asked By: Milan Dierick

||

Answers:

From your app.yaml, it means you’re deploying to Google App Engine (GAE) Standard Environment. GAE doesn’t support streaming – see doc where it says

App Engine does not support streaming responses where data is sent in incremental chunks to the client while a request is being processed. All data from your code is collected as described above and sent as a single HTTP response.

Chat UIs also usually require sockets. GAE Standard doesn’t support web sockets but GAE Flex does (see docs)

Answered By: NoCommandLine