Redis Cluster vs CRDT Rate Limiting
Once a rate limit must hold across many nodes, you face one architectural fork: keep a single authoritative counter in Redis (Cluster mode), or let every node hold a partial counter and reconcile them with a CRDT or gossip protocol. This decision belongs to the Distributed Algorithm Sync parent topic, and it is fundamentally the CAP-theorem choice applied to a counter: under a network partition you can keep the count exact (and stall, sacrificing availability) or keep serving (and drift, sacrificing consistency). You cannot have both.
The problem in concrete numbers
Suppose you enforce 10,000 requests/minute per tenant across 8 gateway nodes in two regions. A centralized counter in Redis Cluster mode keeps one number per tenant, so the limit is exact β but every decision is a cross-node (sometimes cross-region) round-trip of roughly 0.3β1 ms intra-region and 30β80 ms cross-region, and a partition that isolates the Redis primary stalls or fails every check routed to it. A CRDT g-counter keeps a per-node sub-count and merges asynchronously every, say, 200 ms; no request ever waits on the network, and a partition only means each side counts independently. The price is over-admission: during a 5-second partition with traffic split 4/4 across nodes, each side can independently approach the full 10,000, so a tenant briefly sees up to ~2Γ the advertised limit.
Decision matrix
| Criterion | Redis Cluster (centralized) | CRDT / gossip counters |
|---|---|---|
| Consistency model | Strong per key (single slot owner) | Eventual; converges after merge |
| Accuracy under partition | Exact on majority side; minority stalls | Over-admits up to ~NΓ across isolated sides |
| Added per-request latency | One round-trip: ~0.3β1 ms intra-region | ~0 (local read); merge is off the request path |
| Availability during partition | Minority side fails or fails-open | Both sides keep serving |
| Throughput ceiling | Redis QPS + slot hotspots | Node-local, scales with fleet |
| Operational complexity | Moderate: run, shard, fail over Redis | High: merge logic, anti-entropy, convergence bugs |
| Recovery behavior | Immediate once primary reachable | Counts merge on heal; transient overshoot clears |
| Best fit | Hard/billing limits, modest node count | Huge fleets, soft limits, partition-prone topologies |
Selection rules:
- Use Redis Cluster mode when the limit is billing- or compliance-critical, when a brief stall is more acceptable than over-admission, and when your node count and round-trip budget make one authoritative counter cheap. This is the right default for most teams.
- Use CRDT/gossip counters when availability outranks exactness (soft abuse limits, fairness shaping), when the fleet is large enough or partition-prone enough that a central store becomes a bottleneck or a liability, and when you can tolerate β and bound β transient overshoot.
- Hybrid: run a generous CRDT/local layer that absorbs bursts and a periodic reconciliation against an authoritative counter in the Redis counter architecture for the hard ceiling. You get availability in the common case and a backstop on the number that matters.
Step-by-step: a CRDT g-counter for distributed rate limiting
A grow-only counter (g-counter) is the simplest CRDT that fits rate limiting within a fixed window: each node owns one slot in a vector, increments only its own slot, and the merged value is the sum of per-node maxima. Increments commute and are idempotent under merge, so order and duplication do not corrupt the count.
- Keep a per-window vector
{nodeId: localCount}
import time, threading
class GCounter:
"""Grow-only CRDT counter keyed by node id, per fixed window."""
def __init__(self, node_id: str, limit: int, window_s: int):
self.node_id = node_id
self.limit = limit
self.window_s = window_s
self._lock = threading.Lock()
self._slots: dict[str, int] = {} # nodeId -> count, this window
self._window = self._current_window()
def _current_window(self) -> int:
return int(time.time()) // self.window_s
def _roll_if_needed(self) -> None:
w = self._current_window()
if w != self._window: # new window: g-counter is reset wholesale
self._slots = {}
self._window = w
def value(self) -> int:
# Effective count = sum of every node's local sub-count.
return sum(self._slots.values())
def try_acquire(self) -> bool:
with self._lock:
self._roll_if_needed()
if self.value() >= self.limit:
return False # already at the global ceiling
self._slots[self.node_id] = self._slots.get(self.node_id, 0) + 1
return True # increment only our own slot
def merge(self, other_slots: dict[str, int], other_window: int) -> None:
# Idempotent, commutative merge: take the per-node maximum.
with self._lock:
self._roll_if_needed()
if other_window != self._window:
return # ignore stale-window gossip
for nid, cnt in other_slots.items():
self._slots[nid] = max(self._slots.get(nid, 0), cnt)
def snapshot(self) -> tuple[dict[str, int], int]:
with self._lock:
return dict(self._slots), self._window
The merge takes the maximum of each nodeβs slot, not the sum, because two gossip messages may both carry the same nodeβs progress β taking the max makes re-delivery harmless. The effective count is then the sum across slots. This is why the counter over-admits during a partition: each isolated side advances its own slots freely, and only on heal does the merged sum reveal the true total.
Gotchas & edge cases
- Window rolls must be deterministic. Both sides must agree on window boundaries or the merge silently drops counts. Anchor windows to a shared clock and watch clock skew β a node whose clock is fast rolls early and undercounts.
- Over-admission is unbounded by node count, not by time. With 8 nodes fully partitioned into singletons, the worst case approaches 8Γ the limit for the partitionβs duration. Reserve headroom or cap the per-node slot at
limit / nodeCountif a hard ceiling matters. - g-counters only grow β they fit fixed windows, not token buckets. A refilling bucket needs a PN-counter or a different CRDT; do not bolt decrement onto a g-counter.
- Redis Cluster keys must hash to one slot. A multi-key Lua script for one limit must use a hash tag (e.g.
rl:{tenant42}:min) or Redis rejects the cross-slot operation. This is the most common Redis Cluster mode rate-limiting bug. - Fail-open vs fail-closed is still a decision on the Redis path. A partition that isolates the primary forces the question; decide deliberately and alert on the fallback.
Verification & testing
Partition the network mid-load and confirm each design behaves as the model predicts: Redis stalls the minority side and stays exact; the CRDT keeps serving and over-admits, then converges on heal.
# Drive 8 workers at the same tenant key for 30s, severing the link at t=10s.
seq 8 | xargs -P8 -I{} sh -c \
'hey -z 30s -c 25 -H "X-Tenant: t42" https://gw-{}.internal/v1/run' \
| grep -E "Requests/sec|Status code distribution"
# In another shell, drop traffic between the two AZ subnets for 5s:
sudo tc qdisc add dev eth0 root netem loss 100% ; sleep 5 ; sudo tc qdisc del dev eth0 root
Track accepted-vs-rejected totals and, for the CRDT, the peak effective count during the split versus after convergence; see Prometheus metrics for rate limiting for the instrumentation.
Frequently Asked Questions
Is CRDT-based rate limiting ever worth the complexity?
Only when availability genuinely outranks exactness and the fleet is large or partition-prone enough that a central counter is a bottleneck or single point of failure. For most teams a single Redis (Cluster mode with replicas) is simpler, accurate, and fast enough; reach for CRDTs for soft limits at very large scale.
How much does a CRDT counter over-admit during a partition?
In the worst case, up to NΓ the limit, where N is the number of mutually isolated partitions, for the duration of the split. You bound it by shortening the gossip interval, reserving headroom (enforce at 80β90% of the true limit), or capping each node's slot at limit / nodeCount.
Does Redis Cluster mode give a global limit out of the box?
Yes, per key β each key lives on one slot owned by one primary, so increments for that key are serialized and exact. The catch is that a multi-key Lua script for one limit must hash to a single slot via a hash tag, and a partition isolating that primary stalls or fails the affected checks.
Why does the g-counter merge use max instead of sum?
Because gossip can deliver the same node's progress more than once. Taking the per-node maximum makes merges idempotent and commutative, so re-delivery and reordering never double-count. The global value is then the sum across nodes' slots.
Related
- Distributed Algorithm Sync β the parent topic on cross-node state coordination, clock skew, and partition tolerance.
- Redis Counter Architecture β building and scaling the authoritative centralized store.
- Redis vs In-Memory Token Bucket β the simpler per-node-versus-global version of this same tradeoff.
- Algorithm Tradeoff Analysis β where distributed complexity sits among the other selection axes.