AI.FM — Advancing Accessible AI Infrastructure

What Makes an Application Data-Intensive?

A data-intensive application is one where data management — storing it, querying it, maintaining consistency under concurrent access and failure — is the primary engineering challenge. Contrast with compute-intensive systems, where the bottleneck is parallelising large computations. As distributed cloud infrastructure has become accessible to startups, data-intensive challenges that once only large enterprises faced are now common at every scale.

The Standard Building Blocks

Most data-intensive applications are assembled from the same set of components. Understanding each component's purpose — and its limits — is what allows you to choose and combine them well.

Databases

Store data so it can be retrieved later

e.g. PostgreSQL, MySQL, MongoDB, DynamoDB

Caches

Remember expensive results to speed up reads

e.g. Redis, Memcached, CDN edge caches

Search Indexes

Filter and search data by keyword or attribute

e.g. Elasticsearch, Solr, Typesense

Stream Processing

Handle events and data changes as they occur

e.g. Kafka, Kinesis, Flink, Pub/Sub

Batch Processing

Periodically crunch large amounts of accumulated data

e.g. Spark, Hadoop MapReduce, dbt

Operational vs Analytical Systems: OLTP and OLAP

Online Transaction Processing (OLTP) optimises for low-latency point queries and small writes — the access pattern of live user-facing services. Online Analytic Processing (OLAP) optimises for high-throughput scans with aggregation — the access pattern of business intelligence. The same SQL database can run both workloads at small scale, but as data volume and query complexity grow, the optimisations required are fundamentally incompatible: row-oriented storage and B-tree indexes serve OLTP; columnar storage and vectorised execution serve OLAP.

Property	Operational (OLTP)	Analytical (OLAP)
Main read pattern	Point queries — fetch individual records by key	Aggregate over millions of records (sum, count, avg)
Main write pattern	Create / update / delete individual records	Bulk import (ETL) or event stream ingestion
Human user example	End user of a web or mobile application	Internal analyst running reports for management
Machine use example	Checking if an action is authorised	Detecting fraud or abuse patterns
Type of queries	Fixed set — predefined by application code	Analyst can write arbitrary SQL at runtime
Data represents	Latest state at a point in time	History of events that happened over time
Dataset size	Gigabytes to terabytes	Terabytes to petabytes

Product Analytics / Real-Time Analytics

A third category sits between OLTP and OLAP: product analytics (also called real-time analytics). Systems like Pinot, Druid, and ClickHouse ingest data in real time and serve aggregate queries at low latency — enabling user-facing features like live dashboards, personalisation, and fraud scoring. Unlike traditional OLAP systems (which batch-ingest and optimise for throughput), these systems trade some throughput for sub-second query latency.

The Data Warehouse & ETL Pipeline

The data warehouse solves four problems with querying OLTP systems directly: data silos (data spread across many operational DBs), schema mismatch (OLTP schemas optimised for writes, not analytics), performance impact (expensive analytic scans slow down OLTP operations), and security (analysts shouldn't have direct access to production databases). The ETL pipeline extracts from operational sources, transforms into an analytics schema, and loads into a read-optimised warehouse. ELT flips the order — load raw first, transform inside the warehouse using its compute.

Operational DBs

OLTP sources (Sales, Inventory, Geo…)

→

ETL Pipeline

Extract → Transform → Load (Fivetran, dbt, Spark)

→

Data Warehouse

OLAP target (Snowflake, BigQuery…)

From Data Warehouse to Data Lake

A data lake is a centralised repository of raw files (Parquet, Avro, JSON, images, sensor blobs) stored in object storage without a forced schema. Data scientists prefer this because ML feature engineering, NLP pipelines, and computer vision tasks are difficult to express in SQL and relational schemas. The data lake often sits between operational systems and the data warehouse — raw data lands there first (ELT), then selective portions are transformed and promoted into the warehouse. The "sushi principle": preserve raw data, don't discard information in transformation.

Data Warehouse

Relational model, structured schema, SQL queries. Optimised for BI and reporting. Columnar storage (Parquet internally). Business analysts as primary users.

Data Lake

Schema-on-read, raw files (Avro/Parquet/JSON/blobs), object storage. Cheap. Data scientists + ML engineers as primary users. No imposed data model.

Systems of Record vs Derived Data

This distinction cuts through architectural ambiguity. The system of record holds normalised, canonical data — each fact represented exactly once. Any discrepancy between derived and source data is resolved in favour of the source. Derived data systems (indexes, caches, search indexes, materialized views, data warehouses, trained models) are redundant by nature — they duplicate information in a form optimised for specific read patterns. They can be regenerated from the source. A database is neither inherently a system of record nor derived — it depends entirely on how it is used in the architecture.

①

System of Record (Source of Truth)

Where data first enters the system. Normalised — each fact stored once. Writes go here first. If any other system disagrees with this, this wins. Examples: the primary PostgreSQL database for an e-commerce app, the Kafka event log.

②

Derived Data Systems

Read-optimised copies of the source. Redundant by design — regeneratable if lost. Includes: search indexes (Elasticsearch), caches (Redis), data warehouses (Snowflake), materialized views, ML models trained on the data. Each optimised for a specific query pattern.

Cloud vs Self-Hosting

The make-vs-buy decision for infrastructure: in-house (bespoke) → self-hosted off-the-shelf (MySQL on EC2) → cloud managed service (RDS) → cloud-native SaaS (Aurora Serverless, Snowflake). Core tension: cloud services reduce operational burden but remove control; self-hosting preserves control but requires staffing. The cost comparison is workload-dependent: stable, predictable loads often favour self-hosting at scale; variable, bursty workloads (especially analytics) strongly favour cloud elasticity.

For Cloud

✓No upfront hardware cost — pay for what you use
✓Elastic scaling: up during peak, down during idle
✓Provider gains operational expertise from serving many customers
✓Ideal for variable workloads (e.g. analytics queries with bursty compute)
✓Faster to adopt unfamiliar systems — vendor handles operations

For Self-Hosting

✓Cheaper for predictable, stable workloads at scale
✓Full control: tune, patch, and customise the system
✓No vendor lock-in — switch freely between versions
✓Access to server logs, OS metrics, full observability
✓Required for latency-critical or compliance-sensitive workloads

Cloud-Native Architecture: Disaggregated Storage and Compute

Cloud-native systems disaggregate storage and compute. Storage is object storage (S3, GCS, Azure Blob): durable, cheap, scale-independent. Compute is ephemeral instances: provision on demand, release when idle. This enables independent scaling of each dimension and eliminates the cost of idle compute. The trade-off is network I/O — every read from compute to storage crosses the network. Cloud-native databases mitigate this with columnar formats (Parquet), compression, and predicate pushdown. Local disks become caches, not primary storage.

Category	Self-Hosted	Cloud-Native
Operational / OLTP	MySQL, PostgreSQL, MongoDB	AWS Aurora, Azure SQL DB Hyperscale, Google Cloud Spanner
Analytical / OLAP	Teradata, ClickHouse, Spark	Snowflake, Google BigQuery, Azure Synapse Analytics

Mental Models

There are no solutions, only trade-offs

Every data system design decision involves giving something up. Consistency costs availability; cloud convenience costs control; analytics performance costs storage overhead. The job is choosing the best available trade-off for your context.

Operational and analytical systems have opposite access patterns

OLTP systems serve individual records by key at low latency to end users. OLAP systems scan millions of records to aggregate statistics for analysts. Optimising one workload actively harms the other — that's why we separate them.

The data warehouse exists because of data silos and schema mismatch

Operational databases are normalised for write performance; analytics queries need denormalised, aggregated, history-preserving schemas. ETL bridges the gap by extracting, transforming, and loading operational data into a warehouse optimised for OLAP.

Data lake = warehouse without forced schema

A data warehouse imposes a relational model. A data lake stores raw files in any format (Avro, Parquet, JSON, images, sensor readings). It trades queryability for flexibility and cost — object storage is cheap; Parquet files can be read by Spark, pandas, or BigQuery alike.

System of record vs derived data is the cleanest architectural lens

The system of record holds the canonical truth, written once, normalised. Everything else — indexes, caches, materialized views, the data warehouse, the ML model — is derived. If derived data is lost, it can be rebuilt from the source of truth. This distinction clarifies where to put writes and what can be regenerated.

Cloud-native ≠ self-hosted in the cloud

Running MySQL on an EC2 instance is self-hosted on cloud infrastructure. A cloud-native system (like Snowflake or Aurora Serverless) is designed from scratch to exploit disaggregated storage (S3), managed networking, and multi-tenant scheduling. The difference shows up in elasticity, failure recovery, and cost efficiency at scale.

Separation of storage and compute enables elastic analytics

In traditional systems, the database server owns its disk. Cloud-native analytics separate these: compute clusters (Redshift, BigQuery, Snowflake) query data that lives in object storage. Scale compute independently of storage; pay for query time, not idle disk. The cost is network I/O — but fast networks make this tractable.

Specialisation is the trend — general-purpose systems lose at scale

A single database can handle small data volumes comfortably. As workloads grow, specialised systems — columnar stores for analytics, LSM-tree databases for high write throughput, time-series databases for sensor data — outperform general-purpose solutions. Knowing when to specialise is a core data systems skill.

Part of the DDIA series. These notes distil Designing Data-Intensive Applications (2nd ed.) by Martin Kleppmann — the definitive guide to the trade-offs behind databases, messaging systems, and distributed architectures. Chapter 1 establishes the conceptual vocabulary used throughout the entire book.

↑ All Articles

Back to Knowledge Hub

Trade-offs in Data Systems Architecture

What Makes an Application Data-Intensive?

The Standard Building Blocks

Operational vs Analytical Systems: OLTP and OLAP

The Data Warehouse & ETL Pipeline

From Data Warehouse to Data Lake

Data Warehouse

Data Lake

Systems of Record vs Derived Data

Cloud vs Self-Hosting

For Cloud

For Self-Hosting

Cloud-Native Architecture: Disaggregated Storage and Compute

Mental Models