AI.FM — Advancing Accessible AI Infrastructure

Replication vs. Sharding: Two Different Problems

Distributed databases spread data across nodes in two complementary ways. Replication (Ch.6) keeps copies of the same data on multiple nodes — solving availability and read scalability. Sharding (this chapter) splits the dataset itself into smaller subsets called shards (also called partitions, ranges, tablets, vnodes, or token-ranges depending on the database) — solving write scalability and storage capacity.

The two are usually combined: each shard is replicated across multiple nodes for fault tolerance, and each node may act as the leader for some shards and a follower for others. But the choice of sharding scheme is mostly independent of the replication scheme.

Sharding is a last resort. A single modern machine can handle a very large workload. Sharding adds permanent operational complexity: you must choose a partition key (hard to change later), handle cross-shard operations (expensive), and manage rebalancing. Use replication for read throughput and availability. Reach for sharding only when data volume or write throughput genuinely exceeds single-node capacity.

The Goal: Avoiding Skew and Hot Spots

The entire point of sharding is to spread data and query load evenly across nodes. If the distribution is uneven — some shards have much more data or traffic than others — the distribution is called skewed. In the extreme case, all load ends up on a single shard, leaving the other nodes idle. A shard with disproportionately high load is called a hot shard or hot spot. A single key with very high traffic is called a hot key.

The sharding algorithm determines the partition key of a record and maps it to a shard. A good algorithm distributes both data and requests evenly, and must support rebalancing — redistributing shards as nodes are added or removed — without moving more data than necessary.

Sharding for Multitenancy

Beyond scalability, sharding is commonly used in SaaS products to isolate tenants (customers). Each tenant gets their own shard or set of shards, providing resource isolation (one tenant's expensive query doesn't affect others), permission isolation (access control bugs are contained), cell-based architecture (fault isolation between tenant groups), per-tenant backup/restore, regulatory compliance (GDPR right-to-erasure maps to a simple shard delete), data residency (assign a tenant's shard to a specific region), and gradual schema rollout (migrate one tenant at a time).

The main challenges: if a single tenant grows too large for one node, you need to shard within that tenant. And if you have many small tenants, creating a separate shard for each incurs too much overhead — you'll need to group small tenants together, which makes moving them as they grow more complex.

Four Sharding Algorithms

Scheme	How it works	Range scans	Hot spots	Used in
Key-range sharding	Assign contiguous ranges of partition keys to each shard (like encyclopedia volumes A–Ba, Bb–Ca, …)	✓ Efficient — all records in range are co-located	❌ Risk — writes to sequential keys (e.g. timestamps) all hit same shard	Bigtable, HBase, CockroachDB, RethinkDB, MongoDB, YugabyteDB, FoundationDB
Hash sharding (fixed shards)	Hash the partition key → assign hash to one of a fixed number of pre-created shards; move whole shards between nodes	❌ Inefficient — range queries must hit all shards (scatter/gather)	✓ Distributes load uniformly across nodes	Citus (PostgreSQL), Riak, Elasticsearch, Couchbase
Hash-range sharding	Hash the key → assign ranges of hash values to shards (not individual hashes). Shards split when too large.	~ Efficient within same partition key; inefficient across different partition keys	✓ Uniform distribution via hashing	YugabyteDB, DynamoDB, MongoDB (optional), Cassandra/ScyllaDB variant
Consistent hashing	Map both keys and nodes onto a ring; each key is owned by the next node clockwise. Variants: highest random weight (rendezvous), jump consistent hash.	❌ No meaningful range ordering	✓ Minimal key movement when nodes are added/removed	Cassandra/ScyllaDB (variant), CDNs, some distributed caches

Key-Range Sharding

Key-range sharding assigns a contiguous range of partition keys to each shard — like volumes of a paper encyclopedia (A–Bk, Bl–Cz, …). The boundaries between ranges are chosen to distribute data evenly, not alphabetically. Within each shard, keys are stored in sorted order (in a B-tree or SSTable), making range scans efficient.

The classic trap: if you use a timestamp as the partition key for time-series data, all current writes go to the "this month" shard while the others sit idle. Fix this by prepending the sensor or device ID to the key: (sensor_id, timestamp) — writes spread across sensors, and range queries over time require one query per sensor but stay efficient.

Rebalancing: a shard split is triggered when it exceeds a configured size (HBase default: 10 GB) or write throughput threshold. The shard is split into two sub-ranges, each moved to a node. This is expensive — it requires rewriting all the shard's data into new files, similar to a compaction. Splitting a heavily loaded shard can exacerbate the load it's already under.

Hash Sharding with Fixed Shard Count

Hash sharding applies a hash function to the partition key and uses the hash value to assign the key to a shard. A good hash function distributes keys uniformly regardless of the structure of the input — timestamps that are very close together will hash to completely different values.

Don't use hash(key) % N. When the number of nodes N changes, almost every key has to move to a different node — causing a massive, disruptive data migration. Instead, create many more shards than nodes upfront (e.g., 1,000 shards for 10 nodes = 100 shards per node). Keys are assigned to shards by hash(key) % 1000; the system separately tracks which shard is on which node. When a node is added, only whole shards are migrated — no keys change their shard, just the shard changes its host node.

Hash-Range Sharding

The practical best-of-both-worlds for many production systems. Rather than assigning each key directly to a shard by hash, the entire hash space (e.g., 0 to 2³²-1) is divided into contiguous ranges of hash values, and each shard owns one range. This combines the uniform distribution of hashing with the adaptive splitting of key-range sharding: a shard can be split when it gets too large or too hot, adding a new boundary mid-range without affecting other shards. DynamoDB and YugabyteDB use this approach. Cassandra and ScyllaDB use a variant where each node is assigned several hash ranges with random boundaries, smoothing out imbalances.

Consistent Hashing

A consistent hashing algorithm maps both keys and nodes onto a conceptual ring; each key is owned by the nearest node clockwise. The defining property: when a node is added or removed, only ~1/N of keys move (where N is the number of nodes), versus ~(N-1)/N for naive modulo hashing. Several variants exist — highest random weight (rendezvous hashing) and jump consistent hash — each with different performance characteristics. Cassandra's ring-based sharding is related but not identical: it splits existing ranges rather than assigning individual keys to the new node.

Note: "consistent" here refers to the key-to-shard assignment staying stable as the cluster changes. It has nothing to do with replica consistency (Ch.6) or ACID consistency (Ch.8).

Skewed Workloads and Hot-Key Mitigation

Even with a good hash function, if some keys receive far more traffic than others — a celebrity's user ID on a social network, a viral post ID — you still get a hot shard. The hash distributes different keys evenly, but not different access frequencies.

Technique	How it works	Trade-off	Best for
Key prefix/suffix randomisation	Append a 2-digit random number to the hot key, splitting it into 100 virtual keys distributed across shards	Reads must now query all 100 virtual keys and aggregate results. Bookkeeping needed: which keys are "split" hot keys?	Known hot keys (celebrity posts, viral content) with read-heavy or write-heavy load
Composite key prefix	Prefix the hot attribute (e.g. timestamp) with another attribute (e.g. sensor_id) so the sort key is (sensor_id, timestamp) instead of (timestamp)	Range scans over time now require one query per sensor_id instead of a single range query	IoT/time-series data where the "natural" key ordering creates hot write shards
Dedicated hot-key shard	Place the hot key in a shard by itself, potentially on a dedicated high-capacity node	Manual management; requires monitoring to detect hot keys dynamically	A small number of predictably hot keys (e.g. a platform's own system account)
Heat management / adaptive capacity	The system automatically detects hot shards and splits them, migrates them to less-loaded nodes, or allocates more resources	Expensive operation; splitting under high load can cascade into overload	Cloud-native databases with built-in auto-scaling (DynamoDB adaptive capacity)

Hot-key management is harder than it looks because load patterns change over time: a post that goes viral today will cool down in two days. The optimal strategy for a hot write key may be different from a hot read key. Cloud databases like DynamoDB address this with adaptive capacity — automatically detecting and provisioning more resources for hot shards.

Rebalancing: Automatic or Manual?

When nodes are added or removed, shards must be redistributed — rebalancing. This involves moving data across the network, which is expensive and can impact production traffic. The key question: should this happen automatically or require human approval?

Fully automatic rebalancing is convenient for routine maintenance and enables auto-scaling (DynamoDB can add/remove capacity in minutes). But it can be dangerous in combination with automatic failure detection: if a node is overloaded and responding slowly, other nodes may conclude it has failed and trigger a rebalance — moving shards away from it, which adds even more load to the network, potentially causing a cascading failure where other nodes also become overloaded and are also incorrectly assumed to have failed.

Manual rebalancing (or human-in-the-loop) is slower but more predictable. Couchbase and Riak suggest a rebalancing plan automatically but require an administrator to commit it. Many teams also use manual pre-emptive rebalancing before known traffic spikes — Cyber Monday, a major sporting event — to avoid triggering an expensive automatic rebalance under maximum load.

Request Routing: Getting to the Right Node

Once data is sharded, every request must be routed to the node that holds the relevant shard. This is the request routing problem (also called service discovery for stateful data). It has an extra difficulty compared to stateless service discovery: a request for a key can only be handled by a node that is a replica for the shard containing that key — you can't send it to just any node.

01 · Any-node routing (gossip)

Client contacts any node via round-robin load balancer. If that node owns the shard, it handles the request directly; otherwise it forwards to the correct node and relays the response. Each node must know the full shard-to-node mapping.

Consistency: Weaker — gossip-based propagation means nodes may briefly disagree

Used in: Riak (gossip protocol)

02 · Routing tier (dedicated coordinator)

All requests go through a shard-aware routing tier first. The tier determines which node to forward to, but does not process the request itself. Clients only need to know the routing tier address.

Consistency: Strong — routing tier subscribes to coordination service (ZooKeeper/etcd) for authoritative shard assignments

Used in: HBase (ZooKeeper), SolrCloud (ZooKeeper), MongoDB (mongos + config server)

03 · Client-side routing (shard-aware driver)

Clients maintain a local cache of the shard assignment and connect directly to the correct node. The client library subscribes to metadata changes.

Consistency: Strong if kept in sync — client fetches fresh mapping from coordination service

Used in: Kafka (clients), YugabyteDB, TiDB, ScyllaDB (Raft-based coordination built in)

All routing approaches face the same core metadata problem: who decides which shard lives on which node, and how does every participant learn about changes? The production answer is almost always a coordination service. ZooKeeper and etcd maintain the authoritative shard-to-node mapping using a consensus algorithm (Raft or Paxos), providing fault tolerance against split-brain. Nodes register themselves in ZooKeeper; routing tiers and shard-aware clients subscribe for updates. Modern databases — Kafka, YugabyteDB, TiDB, ScyllaDB — embed Raft directly, eliminating the external ZooKeeper dependency.

Sharding and Secondary Indexes

Sharding works cleanly for key-value access — you know the partition key, you find the shard, you fetch the record. Secondary indexes break this model. A secondary index doesn't identify a record uniquely; it maps a value to a list of primary-key IDs (a postings list). The index entries don't map neatly to shards based on primary key.

Approach	How it works	Read cost	Write cost	Used in
Local secondary index (document-partitioned)	Each shard maintains its own secondary index covering only the records in that shard. A write only updates the index on the shard owning that record.	❌ Scatter/gather — a query must be sent to ALL shards and results merged. Prone to tail latency amplification.	✓ Single shard — write only touches the shard owning the primary record	MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB
Global secondary index (term-partitioned)	One index covers all shards, but is itself sharded by the indexed term/value (not by primary key). "color:red" entries from all shards appear in one place.	✓ Single shard for the term — efficient single-term queries	❌ Multiple shards — writing one record may update index entries on several shards; requires distributed transaction or async replication	CockroachDB, TiDB, YugabyteDB; DynamoDB (async global secondary indexes)

Local Secondary Indexes in Depth

Each shard maintains its own local index covering only the records it owns. A write to a record only updates the index on that record's shard. Reads that use the secondary index without knowing the partition key must query every shard and merge the results — a scatter/gather query. The query is sent to all shards in parallel, but you still have to wait for the slowest shard to respond (tail latency amplification). Adding more shards stores more data but doesn't increase secondary index query throughput — every shard still has to process every scatter/gather query.

Global Secondary Indexes in Depth

A global index covers records from all shards, but is itself sharded by the indexed term. For a car listing database, a global index on color might put colors a–r in shard 0 and s–z in shard 1. A query for color=red hits exactly one index shard — efficient. But writing a new car listing might update index entries on several shards (one for color, one for make, one for location). This requires either a distributed transaction (expensive, adds latency) or asynchronous updates (fast, but reads from the global index may see stale data). DynamoDB's global secondary indexes are updated asynchronously, making them similar in behaviour to replication lag.

Data warehouses and partitioning: BigQuery, Snowflake, and Delta Lake all use partitioning for query performance — not just for scalability. In BigQuery, the partition key determines which partition a record resides in, while "cluster columns" determine sort order within partitions. Snowflake uses "micro-partitions" automatically, with optional cluster keys. Delta Lake supports both manual and automatic partition assignment. Clustering data improves both range scan performance and compression efficiency (sorted data of similar type compresses better).

Six Mental Models for Sharding

01 · Sharding is a last resort, not a default

A single modern machine can handle a very large workload — tens of thousands of IOPS, hundreds of gigabytes of RAM, terabytes of SSD. Read throughput problems are solved by replication (Ch.6), not sharding. Shard only when your data volume or write throughput genuinely exceeds single-node capacity. Sharding adds permanent complexity: choosing a partition key, handling cross-shard operations, managing rebalancing. Get the most out of a single node first.

02 · The partition key decision is permanent (and painful to change)

All records with the same partition key live in the same shard. Once you've chosen the partition key, changing it requires resharding — an expensive operation that rewrites all data, consumes disk space, and in some systems requires downtime. Choose a key with high cardinality (many distinct values), even distribution under your real workload, and alignment with your most common access pattern (single-record lookup vs. range scan).

03 · Hash sharding trades range scans for uniform load

Hashing destroys the sort order of keys, making range queries expensive (scatter/gather across all shards). This is acceptable for key-value workloads where you always look up by exact key. For time-series, logs, or ordered data where you frequently scan a range, key-range or hash-range sharding is better. Cassandra and DynamoDB's hash-range sharding is the practical compromise: hash for uniform distribution, sort within the shard for range scan efficiency.

04 · Local indexes are fast to write, slow to read; global indexes are the opposite

A local secondary index only covers one shard — updates are fast (single shard), but queries that don't know the partition key must scatter to all shards. A global secondary index covers all shards — reads are fast (single index shard), but writes must update multiple index shards, requiring distributed coordination. Neither is free. Choose based on your read/write ratio and whether reads are usually on the partition key.

05 · Rebalancing is expensive — minimise data movement

Hash modulo N is naive: adding one node moves ~(N-1)/N of all keys. Fixed shard count solves this: only whole shards move, and only a 1/N fraction of shards. Hash-range sharding further limits movement by only splitting the affected range boundary. The right approach depends on how often you add/remove nodes and how much data movement your network can absorb without impacting production traffic.

06 · Coordination services solve the routing metadata problem

Every routing approach requires answering "which shard is on which node?" This metadata must be consistent, fault-tolerant, and kept up-to-date. ZooKeeper and etcd solve this using consensus algorithms (Raft/Paxos) — they maintain the authoritative shard assignment and notify routing tiers when it changes. Modern databases (Kafka, YugabyteDB, TiDB, ScyllaDB) embed Raft directly rather than depending on an external ZooKeeper cluster.

Part of the Designing Data-Intensive Applications series

Ch.1 · Trade-offs Ch.2 · Nonfunctional Req.Ch.3 · Data Models Ch.4 · Storage Ch.5 · Encoding Ch.6 · Replication

← Ch.6 · Replication Back to Knowledge Hub →

Sharding: Splitting a Dataset Across Many Machines