Back to Study
Article18 min read

Database Sharding Deep Dive

Master horizontal partitioning, shard key selection, and managing cross-shard queries in large-scale databases.

DatabasesShardingScalability

Introduction to Database Sharding

Sharding is a database architecture pattern that horizontally partitions data across multiple database instances. Each partition is called a "shard" and contains a subset of the total data.

Why Shard?

Single database servers have limits:

  • Storage: Maximum disk capacity
  • Throughput: Maximum queries per second
  • Memory: Limited RAM for indexes and caching

Sharding overcomes these limits by distributing data and load across multiple servers.

Sharding vs Replication

Replication copies the same data to multiple nodes (for redundancy and read scaling).

Sharding splits data across nodes (for write scaling and storage capacity).

These techniques are often combined: each shard has replicas for redundancy.

Sharding Strategies

Range-Based Sharding

Data is partitioned based on ranges of the shard key:

  • Shard 1: user_id 1-1,000,000
  • Shard 2: user_id 1,000,001-2,000,000
  • etc.

Pros: Simple, efficient range queries Cons: Can lead to hotspots if data isn't uniformly distributed

Hash-Based Sharding

A hash function determines which shard stores each record:

shard_number = hash(shard_key) % number_of_shards

Pros: Even data distribution Cons: Range queries require hitting all shards, resharding is complex

Directory-Based Sharding

A lookup table maps keys to shards:

| Key Range | Shard | |-----------|-------| | A-M | 1 | | N-Z | 2 |

Pros: Flexible, easy to move data Cons: Lookup table becomes a single point of failure

Geographic Sharding

Data is partitioned by geographic region:

  • US users → US shard
  • EU users → EU shard

Pros: Lower latency, data sovereignty compliance Cons: Users who travel may have higher latency

Choosing a Shard Key

The shard key is the most critical decision in sharding. Consider:

Cardinality

High cardinality (many unique values) allows even distribution. User ID is better than country code.

Access Patterns

Choose a key that matches your query patterns. If you always query by user_id, shard by user_id.

Data Distribution

Ensure data is evenly distributed. Avoid keys that create hotspots (e.g., timestamp for time-series data with mostly recent queries).

Growth Pattern

Consider how data will grow. Avoid keys that require frequent resharding.

Common Shard Key Examples

  • E-commerce: customer_id or order_id
  • Social Media: user_id
  • Multi-tenant SaaS: tenant_id
  • Gaming: player_id or game_id

Challenges of Sharding

Cross-Shard Queries

Queries spanning multiple shards are complex:

-- This query needs to hit all shards
SELECT * FROM users WHERE age > 25

Solutions:

  • Denormalize data to avoid cross-shard joins
  • Use scatter-gather pattern (query all shards, merge results)
  • Maintain global indexes

Cross-Shard Transactions

ACID transactions across shards are difficult. Options:

  • Two-Phase Commit (2PC): Coordinates across shards but slow
  • Saga Pattern: Break into local transactions with compensating actions
  • Avoid: Design to minimize cross-shard transactions

Resharding

Adding or removing shards requires data migration:

  • Plan for resharding from the start
  • Use consistent hashing to minimize data movement
  • Implement online migration to avoid downtime

Operational Complexity

Each shard needs:

  • Monitoring
  • Backups
  • Failover handling
  • Schema migrations

Implementing Sharding

Application-Level Sharding

Application code determines which shard to use:

def get_shard(user_id):
    return user_id % NUM_SHARDS

def get_user(user_id):
    shard = get_shard(user_id)
    connection = get_connection(shard)
    return connection.query("SELECT * FROM users WHERE id = ?", user_id)

Proxy-Based Sharding

A proxy layer handles routing:

  • Vitess (for MySQL)
  • Citus (for PostgreSQL)
  • ProxySQL

Database-Native Sharding

Some databases have built-in sharding:

  • MongoDB
  • CockroachDB
  • TiDB

Best Practices

  1. Start simple: Exhaust vertical scaling and replication first
  2. Choose the shard key carefully: It's hard to change later
  3. Plan for resharding: Use consistent hashing, design for growth
  4. Monitor shard balance: Watch for hotspots
  5. Test thoroughly: Simulate failures and high load

Conclusion

Sharding is a powerful technique for scaling databases beyond single-server limits. While it adds complexity, careful planning and the right tooling can make it manageable. Always consider whether sharding is truly necessary before implementing it.