~/blog/data-platform-redshift-looker-dbt
Data Engineering2019·20 March 2019

Building a Data Platform with Redshift, Looker, and dbt

In 2018-19, TouchNote needed to move from ad-hoc SQL queries to a proper analytics platform. Here's how we built our data stack with Redshift, Looker, and dbt — and what we'd do differently.

RedshiftLookerdbtStitchDatadata-platform

The Analytics Maturity Problem

Most growing businesses go through the same analytics evolution: spreadsheets → ad-hoc SQL queries → a shared database everyone queries directly → chaos. At TouchNote, we were firmly in the "shared database, everyone queries differently" stage. Different people were building different queries for the same metrics and getting different answers. Decisions were being made on inconsistent numbers.

The solution was a dedicated analytics platform — separate from the operational database, with clean data models, shared metric definitions, and tooling for non-engineers to explore data.

The Stack

StitchData for ETL: Stitch is a managed ETL service that connects to your operational databases (MySQL, PostgreSQL), SaaS tools (Stripe, Salesforce, etc.) and syncs data to your data warehouse. We used it to pull operational data from MySQL into Redshift on a scheduled basis. The alternative (building custom ETL pipelines) was significantly more engineering effort for equivalent reliability.

Amazon Redshift as the data warehouse: Columnar storage optimised for analytical queries, tight AWS ecosystem integration, and reasonable cost at our scale. Alternatives we considered: BigQuery (excellent, but tighter Google Cloud tie-in), Snowflake (excellent, we'd use this today), Athena (good for ad-hoc S3 queries, less suitable as a primary warehouse).

dbt for data transformation: dbt (data build tool) transformed our raw replicated data into clean, analytics-ready models. dbt is SQL-first — transformations are just SELECT statements with a thin YAML configuration layer. The killer feature is the DAG (directed acyclic graph) — dbt understands the dependencies between your models and runs them in the right order.

Looker as the BI layer: Looker's LookML modelling layer was transformational. Rather than everyone writing SQL directly, analysts defined metrics, dimensions, and measures in LookML — and business users could explore data through the Looker UI without writing SQL. "What was the subscriber count by region last quarter?" became a click-through operation rather than a SQL exercise.

The dbt Workflow

The workflow we established:

  1. 1.Stitch syncs raw data to raw_* schemas in Redshift
  2. 2.dbt staging models (stg_*) lightly clean raw data (rename columns, cast types, filter deleted records)
  3. 3.dbt mart models (mart_*) join and aggregate staged data into analytics-ready facts and dimensions
  4. 4.Looker explores define which mart models to query and how

The staging/mart separation is important. Staging models are one-to-one with source tables — they do minimal transformation. Mart models are where business logic lives. When a business rule changes (how we count a "subscriber", for example), you change it in one mart model, and all Looker explorations consuming that model pick up the change automatically.

What We'd Do Differently

Data contracts from the start. We had no formal agreement with engineering about which operational database columns were "stable" vs implementation details. When engineers renamed columns or changed data types, Stitch syncs would break silently. Establish data contracts early.

Freshness monitoring. Data in the warehouse should have a known freshness — how old can the data be before it's unacceptably stale? We added Looker freshness alerts late; they should have been day-one requirements.

Snowflake over Redshift. With hindsight, Snowflake's separation of storage and compute, automatic query optimisation, and zero-maintenance profile would have saved us meaningful operational effort.