Introduction to Database Sharding
Sharding is a database architecture pattern that horizontally partitions data across multiple database instances. Each partition is called a "shard" and contains a subset of the total data.
Why Shard?
Single database servers have limits:
- Storage: Maximum disk capacity
- Throughput: Maximum queries per second
- Memory: Limited RAM for indexes and caching
Sharding overcomes these limits by distributing data and load across multiple servers.
Sharding vs Replication
Replication copies the same data to multiple nodes (for redundancy and read scaling).
Sharding splits data across nodes (for write scaling and storage capacity).
These techniques are often combined: each shard has replicas for redundancy.
Sharding Strategies
Range-Based Sharding
Data is partitioned based on ranges of the shard key:
- Shard 1: user_id 1-1,000,000
- Shard 2: user_id 1,000,001-2,000,000
- etc.
Pros: Simple, efficient range queries Cons: Can lead to hotspots if data isn't uniformly distributed
Hash-Based Sharding
A hash function determines which shard stores each record:
shard_number = hash(shard_key) % number_of_shards
Pros: Even data distribution Cons: Range queries require hitting all shards, resharding is complex
Directory-Based Sharding
A lookup table maps keys to shards:
| Key Range | Shard | |-----------|-------| | A-M | 1 | | N-Z | 2 |
Pros: Flexible, easy to move data Cons: Lookup table becomes a single point of failure
Geographic Sharding
Data is partitioned by geographic region:
- US users → US shard
- EU users → EU shard
Pros: Lower latency, data sovereignty compliance Cons: Users who travel may have higher latency
Choosing a Shard Key
The shard key is the most critical decision in sharding. Consider:
Cardinality
High cardinality (many unique values) allows even distribution. User ID is better than country code.
Access Patterns
Choose a key that matches your query patterns. If you always query by user_id, shard by user_id.
Data Distribution
Ensure data is evenly distributed. Avoid keys that create hotspots (e.g., timestamp for time-series data with mostly recent queries).
Growth Pattern
Consider how data will grow. Avoid keys that require frequent resharding.
Common Shard Key Examples
- E-commerce: customer_id or order_id
- Social Media: user_id
- Multi-tenant SaaS: tenant_id
- Gaming: player_id or game_id
Challenges of Sharding
Cross-Shard Queries
Queries spanning multiple shards are complex:
-- This query needs to hit all shards
SELECT * FROM users WHERE age > 25
Solutions:
- Denormalize data to avoid cross-shard joins
- Use scatter-gather pattern (query all shards, merge results)
- Maintain global indexes
Cross-Shard Transactions
ACID transactions across shards are difficult. Options:
- Two-Phase Commit (2PC): Coordinates across shards but slow
- Saga Pattern: Break into local transactions with compensating actions
- Avoid: Design to minimize cross-shard transactions
Resharding
Adding or removing shards requires data migration:
- Plan for resharding from the start
- Use consistent hashing to minimize data movement
- Implement online migration to avoid downtime
Operational Complexity
Each shard needs:
- Monitoring
- Backups
- Failover handling
- Schema migrations
Implementing Sharding
Application-Level Sharding
Application code determines which shard to use:
def get_shard(user_id):
return user_id % NUM_SHARDS
def get_user(user_id):
shard = get_shard(user_id)
connection = get_connection(shard)
return connection.query("SELECT * FROM users WHERE id = ?", user_id)
Proxy-Based Sharding
A proxy layer handles routing:
- Vitess (for MySQL)
- Citus (for PostgreSQL)
- ProxySQL
Database-Native Sharding
Some databases have built-in sharding:
- MongoDB
- CockroachDB
- TiDB
Best Practices
- Start simple: Exhaust vertical scaling and replication first
- Choose the shard key carefully: It's hard to change later
- Plan for resharding: Use consistent hashing, design for growth
- Monitor shard balance: Watch for hotspots
- Test thoroughly: Simulate failures and high load
Conclusion
Sharding is a powerful technique for scaling databases beyond single-server limits. While it adds complexity, careful planning and the right tooling can make it manageable. Always consider whether sharding is truly necessary before implementing it.