blackmathx

Orchestrating Batch Pipelines on Databricks

The Orchestration Plan

Databricks offers Jobs and Batches to configure orchestration. The complete pipeline for my build is two simple steps. The Python script moving Postgres data to raw storage is separate from Databricks, to be ran as a batch process at the moment. The staged and analytics transforms are multiple PySpark scripts in Databricks configured as a DAG.

The Complete Pipeline

Databricks orchestration is basic and straightforward. Once connected to the data warehouse it seems like cheating it's so easy. The Databricks platform offers a text editor to write scripts in multiple langauges, and includes Git and GitHub integration for version control. The 'Jobs & Pipelines' module makes it simple to run batch scripts as DAGs. I host my PySpark transform scripts from Databricks on GitHub for visibility. The data pipeline I built resides in two places, a batch script on my machine to move OLTP data out, and a set of transform scripts orchestrated within Databricks. I am considering moving the script into the platform as well, grouping all of this into one 'Pipeline' that I can schedule altogether.