Running large pipelines on GCP

Question:

I want to scale on cloud a one off pipeline I have locally.

  1. The script takes data from a large (30TB), static S3 bucket made up of PDFs
  2. I pass these PDFs in a ThreadPool to a Docker container, which gives me an output
  3. I save the output to a file.

I can only test it locally on a small fraction of this dataset. The whole pipeline would take a couple days to run on a MacbookPro.

I’ve been trying to replicate this on GCP – which I am still discovering.

  • Using Cloud functions doesn’t work well because of its max timeout
  • A full Cloud composer architecture seems a bit of an overkill for a very straightforward pipeline which doesn’t require Airflow.
  • I’d like to avoid coding this in Apache Beam format for Dataflow.

What is the best way to run such a python data processing pipeline with a container on GCP ?

Asked By: Matthieu

||

Answers:

I would suggest you check on some other alternatives which are Google Kubernetes Engine and Google Compute Engine that meet your requirements

Google Kubernetes Engine (GKE) provides a managed environment for deploying, managing, and scaling your containerized applications using Google infrastructure. The GKE environment consists of multiple machines (specifically, Compute Engine instances) grouped together to form a cluster. The GKE provides a fully managed solution that manages your entire cluster’s infrastructure without worrying about configuring and monitoring, while still delivering a complete Kubernetes experience. Google Kubernetes Engine, which allows you to set up containers on Kubernetes Engine. Please refer to the documentation to know how to deploy an app in a container image to a GKE cluster

Google Compute Engine (GCE) is an infrastructure as a service (IaaS) offering that allows clients to run workloads on Google’s physical hardware.
Google Compute Engine, that lets you create and run scalable and flexible virtual machines on Google infrastructure. It is an ideal solution regarding throughput, stability, pricing, backups, and security. Please refer to the documentation to know how to create and start a Virtual Machine.

As the issue is more related to the architectural guidance, you may reach out to the Google Sales.

Answered By: Sandeep Vokkareni

Thanks to the useful comments in the original post, I explored other alternatives on GCP.

Using a VM on Compute Engine worked perfectly. The overhead and setup is much less than I expected ; the setup went smoothly.

Answered By: Matthieu