Job graph too large to submit to Google Cloud Dataflow

Question:

I am trying to run a job on Dataflow, and whenever I try to submit it to run with DataflowRunner, I receive the following errors from the service:

{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "Request payload size exceeds the limit: x bytes.",
    "reason" : "badRequest"
  } ],
  "message" : "Request payload size exceeds the limit: x bytes.",
  "status" : "INVALID_ARGUMENT"
}
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "(3754670dbaa1cc6b): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
    "reason" : "badRequest",
    "debugInfo" : "detail: "(3754670dbaa1cc6b): CreateJob fails due to Spanner error: New value exceeds the maximum size limit for this column in this database: Jobs.CloudWorkflowJob, size: 17278017, limit: 10485760."n"
  } ],
  "message" : "(3754670dbaa1cc6b): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
  "status" : "INVALID_ARGUMENT"
}

How can I change my job to be smaller, or increase the job size limit?

Asked By: Pablo

||

Answers:

There is a workaround for this issue that will allow you to increase the size of your job graph to up to 100MB. You can specify this experiment: --experiments=upload_graph.

The experiment activates a new submission path which uploads the job file to GCS, and creates the job via an HTTP request that does not contain the job graph – but simply a reference to it.

This has the shortcoming that the UI may not be able to show the job proprely, as it relies on API requests to share the job.


An extra note: It is still good practice to reduce the size of your job graph.

An important tip is that sometimes it’s possible to create some anonymous DoFns / lambda functions that will have a very large context in their closure, so I recommend looking into any closures in your code, and making sure they’re not including very large contexts within themselves.

Perhaps avoiding anonymous lambdas/DoFns will help, as the context will be part of the class, rather than the serialized objects.

Answered By: Pablo

I tried some of the suggestions above first, but it looks like the upload_graph experiment mentioned in the other answer got removed in later versions.

What solved the issue for me when adding a map with 90k+ entries to an existing pipeline was to introduce this map as a MapSideInput. The general approach is documented in this example from the Scio framework I used to interact with Beam:

https://spotify.github.io/scio/examples/RefreshingSideInputExample.scala.html

There was no noticeable performance impact from this approach.

Answered By: markus