We are so excited by the announcement of Datasets for Google Cloud. In this blog post, I'd like to share more details about the new reference architec

Datasets for Google Cloud: Introducing our new reference architecture

submited by
Style Pass
2021-05-24 20:30:06

We are so excited by the announcement of Datasets for Google Cloud. In this blog post, I'd like to share more details about the new reference architecture that we built for a more streamlined data onboarding process for the Google Cloud Public Datasets Program.

For us, data onboarding isn't only about pulling, transforming, and storing data from pre-existing sources into their desired destinations. It's also about making the resulting data easier for analysis, and providing a better experience for developers tasked with building and maintaining data pipelines. The developer experience plays an increasingly vital role in the productivity of data engineering teams as they scale their efforts to hundreds or even thousands of data pipelines.

Our team uses Cloud Composer to manage and monitor data pipelines in a centralized and standardized way. Every data pipeline is represented as a directed acyclic graph (DAG), and every node (also known as a task) in a DAG is represented by an Apache Airflow operator. Each operator performs a single action: from simple actions such as transferring data to and from Cloud Storage, to more complex operations such as using a Google Kubernetes Engine cluster to apply custom data transforms on large datasets. The ability for data engineers to monitor the states of DAG executions and to visualize them as graphs of operations greatly improves comprehensibility and maintainability.

There are many components of a Cloud Composer environment that engineers must constantly manage to keep its pipelines operating like well-oiled machines: writing DAGs in a consistent and predictable manner; declaring, setting, and importing Airflow variables; and actuating other cloud resources that every pipeline relies on. Our new reference architecture aims to simplify all the work mentioned by using YAML configuration files to unify control of these components.

Leave a Comment