blackmathx

From OLTP Database to Analytics Data Pipeline

The Plan

Started with an audio-media sales database that I wanted to get in shape for analytics. Analytics tables need to be designed for the most part to support business reporting. I wrote the Python program to move my AWS RDS Postgres database into raw storage. I could have used AWS Glue with my eyes closed to export the database to S3, but I wanted to reason about the logic myself for the fun of doing it.

Incremental Extraction

The Postgres to S3 batch script does data extraction and sends parquet files to the raw data bucket. To support the incremental uploads in the data pipeline, updated_at fields were added to each table. The program is loading records updated since the previous run. In my previous iterations of the ETL script, I tried other methods to batch load data, including loading invoices by YEAR, using a LIMIT-OFFSET query, and tracking the last seen primary key. The current ETL script loads records by the updated_at field and apends new data, which allows the data in raw storage to be a history of transactions.

The transform layer in the batch ETL script is minimal and can be used to validate the schema expectations before loading. Design for failure and validate often in data engineering. The load module is the AWS SDK for Python, boto3, and load takes the dataframe to a .parquet file before calling put_object.

Data Warehouse on S3

The bucket separates raw/staged/analytics layers. Like I said, the raw layer saves data changes in append-mode. The staged layer quite exactly represents the current state of the database and that is stored with Delta tables over parquet.

The raw tables on S3: s3://media-retail/raw/(table_name)/load_date=(date)/*.parquet.
The staged tables in delta format on S3: s3://media-retail/staged/(table_name)/*.parquet

Finish Line

After the data was loaded in S3, I lined up AWS Athena and AWS Glue with the raw data ang began to query the dataset. My sales dataset was now in S3 and recent updates were as landing in raw storage as expected. I was ready for dimension modeling next.