Document AI process document fails with invalid argument when processing docs from GCS

Question:

I am getting an error very similar to the below, but I am not in EU:
Document AI: google.api_core.exceptions.InvalidArgument: 400 Request contains an invalid argument

When I use the raw_document and process a local pdf file, it works fine. However, when I specify a pdf file on a GCS location, it fails.

Error message:

the processor name: projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97
the form process request: name: "projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97"
inline_document {
  uri: "gs://xxxx/temp/test1.pdf"
}

Traceback (most recent call last):
  File "C:Python39libsite-packagesgoogleapi_coregrpc_helpers.py", line 66, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "C:Python39libsite-packagesgrpc_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "C:Python39libsite-packagesgrpc_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INVALID_ARGUMENT
        details = "Request contains an invalid argument."
        debug_error_string = "{"created":"@1647296055.582000000","description":"Error received from peer ipv4:142.250.80.74:443","file":"src/core/lib/surface/call.cc","file_line":1070,"grpc_message":"Request contains an invalid argument.","grpc_status":3}"
>

Code:

   client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
    print(f'the processor name: {name}')

    # document = {"uri": gcs_path, "mime_type": "application/pdf"}
    document = {"uri": gcs_path}
    inline_document = documentai.Document()
    inline_document.uri = gcs_path
    # inline_document.mime_type = "application/pdf"

    # Configure the process request
    # request = {"name": name, "inline_document": document}
    request = documentai.ProcessRequest(
        inline_document=inline_document,
        name=name
    )    

    print(f'the form process request: {request}')

    result = client.process_document(request=request)

I do not believe I have permission issues on the bucket since the same set up works fine for a document classification process on the same bucket.

Asked By: sacoder

||

Answers:

This is a known issue for Document AI, and is already reported in this issue tracker. Unfortunately the only workaround for now is to either:

  1. Download your file, read the file as bytes and use process_documents(). See Document AI local processing for the sample code.
  2. Use batch_process_documents() since by default is only accepts files from GCS. This is if you don’t want to do the extra step on downloading the file.
Answered By: Ricco D

This is still an issue 5 months later, and something not mentioned in the accepted answer is (and I could be wrong but it seems to me) that batch processes are only able to output their results to GCS, so you’ll still incur the extra step of downloading something from a bucket (be it the input document under Option 1 or the result under Option 2). On top of that, you’ll end up having to do cleanup in the bucket if you don’t want the results there, so under many circumstances, Option 2 won’t present much of an advantage other than the fact that the result download will probably be smaller than the input file download.

I’m using the client library in a Python Cloud Function and I’m affected by this issue. I’m implementing Option 1 for the reason that it seems simplest and I’m holding out for the fix. I also considered using the Workflow client library to fire a Workflow that runs a Document AI process, or calling the Document AI REST API, but it’s all very suboptimal.

Answered By: InternetDenizen