What Makes an Application Data-Intensive?
A data-intensive application is one where data management β storing it, querying it, maintaining consistency under concurrent access and failure β is the primary engineering challenge. Contrast with compute-intensive systems, where the bottleneck is parallelising large computations. As distributed cloud infrastructure has become accessible to startups, data-intensive challenges that once only large enterprises faced are now common at every scale.
The Standard Building Blocks
Most data-intensive applications are assembled from the same set of components. Understanding each component's purpose β and its limits β is what allows you to choose and combine them well.
Store data so it can be retrieved later
e.g. PostgreSQL, MySQL, MongoDB, DynamoDB
Remember expensive results to speed up reads
e.g. Redis, Memcached, CDN edge caches
Filter and search data by keyword or attribute
e.g. Elasticsearch, Solr, Typesense
Handle events and data changes as they occur
e.g. Kafka, Kinesis, Flink, Pub/Sub
Periodically crunch large amounts of accumulated data
e.g. Spark, Hadoop MapReduce, dbt
Operational vs Analytical Systems: OLTP and OLAP
Online Transaction Processing (OLTP) optimises for low-latency point queries and small writes β the access pattern of live user-facing services. Online Analytic Processing (OLAP) optimises for high-throughput scans with aggregation β the access pattern of business intelligence. The same SQL database can run both workloads at small scale, but as data volume and query complexity grow, the optimisations required are fundamentally incompatible: row-oriented storage and B-tree indexes serve OLTP; columnar storage and vectorised execution serve OLAP.
| Property | Operational (OLTP) | Analytical (OLAP) |
|---|---|---|
| Main read pattern | Point queries β fetch individual records by key | Aggregate over millions of records (sum, count, avg) |
| Main write pattern | Create / update / delete individual records | Bulk import (ETL) or event stream ingestion |
| Human user example | End user of a web or mobile application | Internal analyst running reports for management |
| Machine use example | Checking if an action is authorised | Detecting fraud or abuse patterns |
| Type of queries | Fixed set β predefined by application code | Analyst can write arbitrary SQL at runtime |
| Data represents | Latest state at a point in time | History of events that happened over time |
| Dataset size | Gigabytes to terabytes | Terabytes to petabytes |
Product Analytics / Real-Time Analytics
A third category sits between OLTP and OLAP: product analytics (also called real-time analytics). Systems like Pinot, Druid, and ClickHouse ingest data in real time and serve aggregate queries at low latency β enabling user-facing features like live dashboards, personalisation, and fraud scoring. Unlike traditional OLAP systems (which batch-ingest and optimise for throughput), these systems trade some throughput for sub-second query latency.
The Data Warehouse & ETL Pipeline
The data warehouse solves four problems with querying OLTP systems directly: data silos (data spread across many operational DBs), schema mismatch (OLTP schemas optimised for writes, not analytics), performance impact (expensive analytic scans slow down OLTP operations), and security (analysts shouldn't have direct access to production databases). The ETL pipeline extracts from operational sources, transforms into an analytics schema, and loads into a read-optimised warehouse. ELT flips the order β load raw first, transform inside the warehouse using its compute.
Operational DBs
OLTP sources (Sales, Inventory, Geoβ¦)
ETL Pipeline
Extract β Transform β Load (Fivetran, dbt, Spark)
Data Warehouse
OLAP target (Snowflake, BigQueryβ¦)
From Data Warehouse to Data Lake
A data lake is a centralised repository of raw files (Parquet, Avro, JSON, images, sensor blobs) stored in object storage without a forced schema. Data scientists prefer this because ML feature engineering, NLP pipelines, and computer vision tasks are difficult to express in SQL and relational schemas. The data lake often sits between operational systems and the data warehouse β raw data lands there first (ELT), then selective portions are transformed and promoted into the warehouse. The "sushi principle": preserve raw data, don't discard information in transformation.
Data Warehouse
Relational model, structured schema, SQL queries. Optimised for BI and reporting. Columnar storage (Parquet internally). Business analysts as primary users.
Data Lake
Schema-on-read, raw files (Avro/Parquet/JSON/blobs), object storage. Cheap. Data scientists + ML engineers as primary users. No imposed data model.
Systems of Record vs Derived Data
This distinction cuts through architectural ambiguity. The system of record holds normalised, canonical data β each fact represented exactly once. Any discrepancy between derived and source data is resolved in favour of the source. Derived data systems (indexes, caches, search indexes, materialized views, data warehouses, trained models) are redundant by nature β they duplicate information in a form optimised for specific read patterns. They can be regenerated from the source. A database is neither inherently a system of record nor derived β it depends entirely on how it is used in the architecture.
Where data first enters the system. Normalised β each fact stored once. Writes go here first. If any other system disagrees with this, this wins. Examples: the primary PostgreSQL database for an e-commerce app, the Kafka event log.
Read-optimised copies of the source. Redundant by design β regeneratable if lost. Includes: search indexes (Elasticsearch), caches (Redis), data warehouses (Snowflake), materialized views, ML models trained on the data. Each optimised for a specific query pattern.
Cloud vs Self-Hosting
The make-vs-buy decision for infrastructure: in-house (bespoke) β self-hosted off-the-shelf (MySQL on EC2) β cloud managed service (RDS) β cloud-native SaaS (Aurora Serverless, Snowflake). Core tension: cloud services reduce operational burden but remove control; self-hosting preserves control but requires staffing. The cost comparison is workload-dependent: stable, predictable loads often favour self-hosting at scale; variable, bursty workloads (especially analytics) strongly favour cloud elasticity.
For Cloud
- βNo upfront hardware cost β pay for what you use
- βElastic scaling: up during peak, down during idle
- βProvider gains operational expertise from serving many customers
- βIdeal for variable workloads (e.g. analytics queries with bursty compute)
- βFaster to adopt unfamiliar systems β vendor handles operations
For Self-Hosting
- βCheaper for predictable, stable workloads at scale
- βFull control: tune, patch, and customise the system
- βNo vendor lock-in β switch freely between versions
- βAccess to server logs, OS metrics, full observability
- βRequired for latency-critical or compliance-sensitive workloads
Cloud-Native Architecture: Disaggregated Storage and Compute
Cloud-native systems disaggregate storage and compute. Storage is object storage (S3, GCS, Azure Blob): durable, cheap, scale-independent. Compute is ephemeral instances: provision on demand, release when idle. This enables independent scaling of each dimension and eliminates the cost of idle compute. The trade-off is network I/O β every read from compute to storage crosses the network. Cloud-native databases mitigate this with columnar formats (Parquet), compression, and predicate pushdown. Local disks become caches, not primary storage.
| Category | Self-Hosted | Cloud-Native |
|---|---|---|
| Operational / OLTP | MySQL, PostgreSQL, MongoDB | AWS Aurora, Azure SQL DB Hyperscale, Google Cloud Spanner |
| Analytical / OLAP | Teradata, ClickHouse, Spark | Snowflake, Google BigQuery, Azure Synapse Analytics |
Mental Models
Every data system design decision involves giving something up. Consistency costs availability; cloud convenience costs control; analytics performance costs storage overhead. The job is choosing the best available trade-off for your context.
OLTP systems serve individual records by key at low latency to end users. OLAP systems scan millions of records to aggregate statistics for analysts. Optimising one workload actively harms the other β that's why we separate them.
Operational databases are normalised for write performance; analytics queries need denormalised, aggregated, history-preserving schemas. ETL bridges the gap by extracting, transforming, and loading operational data into a warehouse optimised for OLAP.
A data warehouse imposes a relational model. A data lake stores raw files in any format (Avro, Parquet, JSON, images, sensor readings). It trades queryability for flexibility and cost β object storage is cheap; Parquet files can be read by Spark, pandas, or BigQuery alike.
The system of record holds the canonical truth, written once, normalised. Everything else β indexes, caches, materialized views, the data warehouse, the ML model β is derived. If derived data is lost, it can be rebuilt from the source of truth. This distinction clarifies where to put writes and what can be regenerated.
Running MySQL on an EC2 instance is self-hosted on cloud infrastructure. A cloud-native system (like Snowflake or Aurora Serverless) is designed from scratch to exploit disaggregated storage (S3), managed networking, and multi-tenant scheduling. The difference shows up in elasticity, failure recovery, and cost efficiency at scale.
In traditional systems, the database server owns its disk. Cloud-native analytics separate these: compute clusters (Redshift, BigQuery, Snowflake) query data that lives in object storage. Scale compute independently of storage; pay for query time, not idle disk. The cost is network I/O β but fast networks make this tractable.
A single database can handle small data volumes comfortably. As workloads grow, specialised systems β columnar stores for analytics, LSM-tree databases for high write throughput, time-series databases for sensor data β outperform general-purpose solutions. Knowing when to specialise is a core data systems skill.
Part of the DDIA series. These notes distil Designing Data-Intensive Applications (2nd ed.) by Martin Kleppmann β the definitive guide to the trade-offs behind databases, messaging systems, and distributed architectures. Chapter 1 establishes the conceptual vocabulary used throughout the entire book.