-
Notifications
You must be signed in to change notification settings - Fork 40
Workflows
Cloud Workflows let you orchestrate and automate Google Cloud and HTTP-based API services with serverless workflows.
There are many kinds of tools to implement different types of workflows.
Google Cloud’s first general-purpose workflow orchestration tool was Cloud Composer. Based on Apache Airflow, Cloud Composer is great for data engineering pipelines like ETL orchestration, big data processing, or machine learning workflows, and integrates well with data products like BigQuery or Dataflow. Cloud Composer is a natural choice if your workflow needs to run a series of jobs in a data warehouse or big data cluster, and save results to a storage bucket.
If you want to process events or chain APIs in a serverless way, or have workloads that are bursty or latency-sensitive, it may be better to use Workflows.
https://cloud.google.com/blog/products/application-development/get-to-know-google-cloud-workflows
https://medium.com/google-cloud/gcp-cloud-workflows-orchestrate-in-declarative-way-3cfacda25028
Google Cloud provides services supporting both Orchestration and Choreography approaches.
Pub/Sub and Eventarc are both suited for choreography of event-driven services, whereas Workflows is suited for centrally orchestrated services.
Workflows is a service to orchestrate not only Google Cloud services, such as Cloud Functions and Cloud Run, but also external services. Should there be a central orchestrator controlling all interactions between services or should each service work independently and only interact through events? This is the central question in the Orchestration vs Choreography debate.
In Orchestration, a central service defines and controls the flow of communication between services. With centralization, it becomes easier to change and monitor the flow and apply consistent timeout and error policies.
In Choreography, each service registers for and emits events as they need. There’s usually a central event broker to pass messages around, but it does not define or direct the flow of communication. This allows services that are truly independent at the expense of less traceable and manageable flow and policies.
Workflows publish connectors to make it easier to access other Google Cloud products within a workflow. Connectors can be used to connect to other Google Cloud APIs within a workflow, helping you integrate your workflows with other Google Cloud products. For example, you can use connectors to publish Pub/Sub messages, read or write data to a Firestore database, or retrieve authentication keys from Secret Manager.
https://cloud.google.com/workflows/docs/reference/googleapis
https://cloud.google.com/workflows/docs/reference/syntax/syntax-cheat-sheet
https://cloud.google.com/workflows/docs/connectors-samples
https://github.com/GoogleCloudPlatform/workflows-samples/tree/main/src/connectors
https://github.com/GoogleCloudPlatform/workflows-demos/tree/master/connector-compute
https://cloud.google.com/blog/topics/developers-practitioners/introducing-workflows-callbacks
https://medium.com/google-cloud/replicate-data-from-bigquery-to-cloud-sql-2b23a08c52b1
https://cloud.google.com/workflows/docs/execute-parallel-steps
https://medium.com/google-cloud/workflows-tipsn-tricks-d7196eb5098d
The Cloud Composer is a fully managed workflow orchestration built on Apache Airflow.
The Dataproc WorkflowTemplates API provides a flexible and easy-to-use mechanism for managing and executing workflows. A Workflow Template is a reusable workflow configuration. It defines a graph of jobs with information on where to run those jobs.
Dataflow is a managed service for executing a wide variety of data processing patterns.
Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on containers. AI Hub has support for Kubeflow Pipelines.
A survey of pipeline & workflow tools.
Orchestration often refers to the automated configuration, coordination, and management of computer systems and services.
In the context of service-oriented architectures, orchestration can range from executing a single service at a specific time and day, to a more sophisticated approach of automating and monitoring multiple services over longer periods of time, with the ability to react and handle failures as they crop up. In the data engineering context, orchestration is central to coordinating the services and workflows that prepare, ingest, and transform data. It can go beyond data processing and also involve a workflow to train a machine learning (ML) model from the data.
Google Cloud Platform offers a number of tools and services for orchestration:
- Cloud Scheduler for schedule-driven single-service orchestration
- Workflows for complex multi-service orchestration
- Cloud Composer for orchestration of your data workloads
https://cloud.google.com/blog/topics/developers-practitioners/implementing-saga-pattern-workflows
https://glaforge.appspot.com/article/sending-an-email-with-sendgrid-from-workflows
https://glaforge.appspot.com/article/load-and-use-json-data-in-your-workflow-from-gcs
Review the comparisons of Airflow, Luigi, Argo, MLflow and Kubeflow
https://github.com/PrefectHQ/prefect
https://github.com/snakemake/snakemake
https://medium.com/@kolban1/gcp-workflows-visual-editor-9876fb1c823f
https://github.com/danielecook/Awesome-Bioinformatics#workflow-managers
https://github.com/pditommaso/awesome-pipeline
https://blog.sellorm.com/2018/06/02/first-steps-with-data-pipelines/
https://the-turing-way.netlify.app/reproducible-research/make.html
https://github.com/kyclark/make-tutorial
https://www.gnu.org/software/parallel/
https://cloud.google.com/workflows/docs/create-workflow-console
https://cloud.google.com/workflows/docs/run/tutorial-cloud-run
https://cloud.google.com/workflows/docs/tutorial-translation-connector
https://medium.com/google-cloud/long-running-job-with-cloud-workflows-38b57bea74a5
https://cloud.google.com/community/tutorials/ml-pipeline-with-workflows
https://medium.com/google-cloud/bigquery-snapshot-dataset-with-cloud-workflow-5175eb8df00b
Perform a large metagenomics sequencing experiment – 96 10X Genomics linked read libraries sequenced across 25 lanes on a HiSeq4000 in GCP.
https://medium.com/google-cloud/cromwell-hello-gcp-833c18df3caf
https://medium.com/google-cloud/worklows-state-management-with-firestore-99237f08c5c5
https://medium.com/google-cloud/executing-commands-gcloud-kubectl-from-workflows-ad6b85eaf39c