Database Sharding Deep Dive | Archicise

Introduction to Database Sharding

Sharding is a database architecture pattern that horizontally partitions data across multiple database instances. Each partition is called a "shard" and contains a subset of the total data.

Why Shard?

Single database servers have limits:

Storage: Maximum disk capacity
Throughput: Maximum queries per second
Memory: Limited RAM for indexes and caching

Sharding overcomes these limits by distributing data and load across multiple servers.

Sharding vs Replication

Replication copies the same data to multiple nodes (for redundancy and read scaling).

Sharding splits data across nodes (for write scaling and storage capacity).

These techniques are often combined: each shard has replicas for redundancy.

Sharding Strategies

Range-Based Sharding

Data is partitioned based on ranges of the shard key:

Shard 1: user_id 1-1,000,000
Shard 2: user_id 1,000,001-2,000,000
etc.

Pros: Simple, efficient range queries Cons: Can lead to hotspots if data isn't uniformly distributed

Hash-Based Sharding

A hash function determines which shard stores each record:

shard_number = hash(shard_key) % number_of_shards

Pros: Even data distribution Cons: Range queries require hitting all shards, resharding is complex

Directory-Based Sharding

A lookup table maps keys to shards:

| Key Range | Shard | |-----------|-------| | A-M | 1 | | N-Z | 2 |

Pros: Flexible, easy to move data Cons: Lookup table becomes a single point of failure

Geographic Sharding

Data is partitioned by geographic region:

US users → US shard
EU users → EU shard

Pros: Lower latency, data sovereignty compliance Cons: Users who travel may have higher latency

Choosing a Shard Key

The shard key is the most critical decision in sharding. Consider:

Cardinality

High cardinality (many unique values) allows even distribution. User ID is better than country code.

Access Patterns

Choose a key that matches your query patterns. If you always query by user_id, shard by user_id.

Data Distribution

Ensure data is evenly distributed. Avoid keys that create hotspots (e.g., timestamp for time-series data with mostly recent queries).

Growth Pattern

Consider how data will grow. Avoid keys that require frequent resharding.

Common Shard Key Examples

E-commerce: customer_id or order_id
Social Media: user_id
Multi-tenant SaaS: tenant_id
Gaming: player_id or game_id

Challenges of Sharding

Cross-Shard Queries

Queries spanning multiple shards are complex:

-- This query needs to hit all shards
SELECT * FROM users WHERE age > 25

Solutions:

Denormalize data to avoid cross-shard joins
Use scatter-gather pattern (query all shards, merge results)
Maintain global indexes

Cross-Shard Transactions

ACID transactions across shards are difficult. Options:

Two-Phase Commit (2PC): Coordinates across shards but slow
Saga Pattern: Break into local transactions with compensating actions
Avoid: Design to minimize cross-shard transactions

Resharding

Adding or removing shards requires data migration:

Plan for resharding from the start
Use consistent hashing to minimize data movement
Implement online migration to avoid downtime

Operational Complexity

Each shard needs:

Monitoring
Backups
Failover handling
Schema migrations

Implementing Sharding

Application-Level Sharding

Application code determines which shard to use:

def get_shard(user_id):
    return user_id % NUM_SHARDS

def get_user(user_id):
    shard = get_shard(user_id)
    connection = get_connection(shard)
    return connection.query("SELECT * FROM users WHERE id = ?", user_id)

Proxy-Based Sharding

A proxy layer handles routing:

Vitess (for MySQL)
Citus (for PostgreSQL)
ProxySQL

Database-Native Sharding

Some databases have built-in sharding:

MongoDB
CockroachDB
TiDB

Best Practices

Start simple: Exhaust vertical scaling and replication first
Choose the shard key carefully: It's hard to change later
Plan for resharding: Use consistent hashing, design for growth
Monitor shard balance: Watch for hotspots
Test thoroughly: Simulate failures and high load

Conclusion

Sharding is a powerful technique for scaling databases beyond single-server limits. While it adds complexity, careful planning and the right tooling can make it manageable. Always consider whether sharding is truly necessary before implementing it.