blackmathx

Lessons Learned Building a Data Pipeline

What Happened

The decision to make tradeoffs were hard. I wanted a design that was production ready, but I wanted to follow through to completion as soon as possible. I was able to pull from the source database with a design that minimized compute resources. That matches performance of a scaled up system by fetching only the latest updates, and instead of rewriting the raw data during each run. Antoher tradeoff was foregoing techniques to manage slowly changing dimensions. Over the course of this project, I learned the importance of SCD in a system that is built to scale out into the future with sensitive analytics.

Reflection

Databricks is popular, and that over Snowflake is not a bad choice. It has compute built in, I was able to focus on transformations with PySpark, and the orchestration is as good or better than anything available. I started with dbt and decided against it, because connecting it to Athena proved that I don't understand dbt. Learning dbt should start with a connection to Redshift, DuckDB, or BigQuery, not Glue and Athena. So add Databricks and PySpark to my toolbelt. Add Athena and Glue as well. Save dbt and Snowflake for another week.

What I Learned

I learned modern data engineering techniques and found many good books and resources along the way. The project was engaging. Data engineering piques my interest and I am looking forward to new opportunities working with data in the future.