blackmathx

Designing an S3 Data Lake for Analytics

The Plan

The Data Lake is perfect for hosting analytics data. It can hold a lot of of it and storage cost is cheap at scale. There are other benefits of a data lake too. A data lake is independent of a query engine, you can use whatver you like, Athena, Databricks, Snowflake, dbt, etc.

The Move

I began by hooking up Athena and AWS Glue to do data reads from my raw s3 bucket. I considered a local dbt project using Athena for SQL engine for transformations, but the configuration proved too difficult. Instead, I went with Databricks on top of Spark where I could define easily the orchestration layer in one shot. The architecture I chose for the data pipeline is this. Postgres to S3 raw storage via a Python batch program. The script queries on updated_at to fetch and upload only the newest records. Reading the only recent changes and writing in append mode simply reduces unneccessary compute resources I have Databricks for orchestrating the transformation jobs through my medallion layers and it utilizes Spark via PySpark.

Storage Design

The S3 is partitioned for each medallion layer -- raw, staged, analytics. The folder structure that makes sense is this:
s3://media-retail-314159/raw/<table>/<date>/*.parquet
s3://media-retail-314159/staged/<table>/*.parquet
s3://media-retail-314159/analytics/<table>/*.parquet