Redis vs In-Memory Token Bucket

Q: Is the Redis round-trip latency worth it?

For a global limit, yes — 0.2 to 1 ms intra-AZ is negligible next to typical API latency and is the only way to hold one number across many nodes. If the limit is genuinely per-node, an in-memory bucket is correct.

Choosing between a Redis-backed and an in-memory token bucket is the first hard decision you make when you scale rate limiting past a single process. This question sits under the token bucket algorithm guide, and it turns on one variable: whether your limit must hold globally across every node, or only locally per process. Pick wrong and you either double your effective limit under load balancing, or you add a network round-trip to every request for a guarantee you never needed.

The problem in concrete numbers

Say you advertise 100 requests/second per API key and you run 5 application nodes behind a round-robin load balancer. With a purely in-memory bucket, each node refills its own 100 rps reservoir, so a single key can spend up to 500 rps across the fleet — a 5× overshoot that silently breaks the contract you sell to customers. A Redis-backed bucket keeps one authoritative reservoir, so the same key is held to 100 rps no matter which node answers. The cost: every decision becomes a Redis round-trip, typically 0.2–1 ms intra-AZ, added to request latency and to your Redis QPS bill.

Decision matrix

Criterion	In-memory bucket	Redis-backed bucket
Accuracy across nodes	Per-process only; N× overshoot	Global, exact within one round-trip
Added per-request latency	~0 (local memory)	~0.2–1 ms intra-AZ; more cross-AZ
Throughput ceiling	CPU-bound, millions/s	Redis QPS + network bound
Failure blast radius	None (no dependency)	Redis outage stalls every limit check
State on deploy/restart	Lost — buckets reset, brief over-allow	Survives; TTL-evicted keys persist
Operational cost	Nil	Redis cluster to run, monitor, scale
Best fit	Single node, or coarse per-node caps	Multi-node global quotas, billing-critical limits

Selection rules:

Use in-memory when you run a single instance, when “per-node” is an acceptable definition of the limit, or as a fast local pre-filter that rejects obvious floods before they reach Redis.
Use Redis when the advertised limit is global, when correctness is billing- or compliance-critical, or when autoscaling makes the node count unpredictable (a per-node cap drifts every time you scale).
Use both (tiered): a generous in-memory bucket absorbs bursts and shields Redis, while the Redis counter architecture holds the authoritative global limit. This is the standard high-traffic pattern.

Step-by-step: a hybrid local + global bucket

The robust production design is a two-tier check: a local bucket that fails open and a global Redis bucket that is authoritative. Implement it in order.

Provision a Redis instance reachable from every node with sub-millisecond intra-AZ latency.
Define the limit once in config (capacity, refill_rate Define the limit once in config (`capacity`, `refill_rate`) and load it identically on every node.
Add a local per-node bucket sized generously (e.g. 2× the fair per-node share) as a cheap pre-filter.
Add the authoritative Redis bucket via an atomic Lua script (single round-trip, no read-modify-write race).
Decide a fail-open vs fail-closed policy for Redis outages and make it explicit in code.
Emit X-RateLimit-* and Retry-After Emit `X-RateLimit-*` and `Retry-After` headers from the authoritative decision.

// Two-tier token bucket: local pre-filter + authoritative Redis bucket.
import Redis from "ioredis";
const redis = new Redis(process.env.REDIS_URL!);

// Atomic refill-and-consume; returns [allowed, remaining]. One round-trip, no race.
const BUCKET_LUA = `
local key      = KEYS[1]
local capacity = tonumber(ARGV[1])
local rate     = tonumber(ARGV[2])     -- tokens per second
local now      = tonumber(ARGV[3])     -- ms
local cost     = tonumber(ARGV[4])
local b = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(b[1]) or capacity
local ts     = tonumber(b[2]) or now
tokens = math.min(capacity, tokens + (now - ts) / 1000 * rate)
local allowed = 0
if tokens >= cost then tokens = tokens - cost; allowed = 1 end
redis.call('HSET', key, 'tokens', tokens, 'ts', now)
redis.call('PEXPIRE', key, math.ceil(capacity / rate * 1000) + 1000)
return { allowed, math.floor(tokens) }`;

const local = new Map<string, { tokens: number; ts: number }>();
function localAllow(key: string, cap: number, rate: number): boolean {
  const now = Date.now();
  const b = local.get(key) ?? { tokens: cap, ts: now };
  b.tokens = Math.min(cap, b.tokens + ((now - b.ts) / 1000) * rate);
  b.ts = now;
  if (b.tokens < 1) { local.set(key, b); return false; } // shed before hitting Redis
  b.tokens -= 1; local.set(key, b); return true;
}

export async function allow(key: string): Promise<{ ok: boolean; remaining: number }> {
  // Local pre-filter sized at 2x the fair per-node share (5 nodes => 40 rps local).
  if (!localAllow(`local:${key}`, 80, 40)) return { ok: false, remaining: 0 };
  try {
    const [ok, remaining] = (await redis.eval(
      BUCKET_LUA, 1, `rl:${key}`, 100, 100, Date.now(), 1,
    )) as [number, number];
    return { ok: ok === 1, remaining };
  } catch {
    return { ok: true, remaining: -1 }; // fail-open: Redis down, lean on local bucket
  }
}

Gotchas & edge cases

In-memory overshoot is silent. Nothing logs the 5× breach; you only see it as downstream saturation. If you advertise a global number, a per-node bucket is the wrong tool, not a cheaper one.
Fail-open vs fail-closed is a business decision. Failing open during a Redis outage protects availability but lets traffic through; failing closed protects the backend but returns 429s during your incident. Choose deliberately and alert on the fallback path.
Clock skew skews refill. The Lua script uses Redis server time implicitly only if you pass it; passing each node’s Date.now() means clock skew between nodes perturbs refill. Prefer redis.call('TIME') inside the script for a single clock source.
Restarts reset in-memory state. A rolling deploy briefly resets local buckets, allowing a short over-allow window. Usually fine; not fine for hard quotas.
Hot keys concentrate load. One abusive key funnels every request to a single Redis slot. A local pre-filter is what saves Redis here.

Verification & testing

Drive concurrent load from multiple workers and confirm the aggregate accepted rate matches the limit, not a multiple of it.

# 5 parallel workers, same key, 3s — expect ~100 rps accepted total with Redis,
# ~500 rps with a pure in-memory bucket.
seq 5 | xargs -P5 -I{} sh -c \
  'hey -z 3s -c 20 -H "X-API-Key: acct_42" https://api.example.com/v1/search' \
  | grep -E "Requests/sec|Status code"

Watch rate_limit_redis_latency_seconds and the accepted-vs-rejected ratio while the test runs; see Prometheus metrics for rate limiting for the instrumentation.

Frequently Asked Questions

Is the Redis round-trip latency worth it?

For a global limit, yes — 0.2–1 ms intra-AZ is negligible next to typical API latency, and it is the only way to hold one number across many nodes. If your limit is genuinely per-node, the round-trip buys you nothing and an in-memory bucket is correct.

Can I avoid Redis and still get a global limit?

Only approximately. Gossip or CRDT-based counters spread state across nodes without a central store, but they trade exactness for availability and are far more complex to operate. For most teams a single Redis (with replicas) is simpler and accurate enough.

How big should the local pre-filter bucket be?

Larger than the fair per-node share so it never rejects legitimate traffic that Redis would allow, but small enough to shed obvious floods. Roughly 1.5–2× the per-node share (global limit ÷ node count) is a sensible default.

What happens to in-flight limits during a deploy?

In-memory buckets reset to full on restart, so a rolling deploy briefly over-allows. Redis-backed buckets persist through the deploy because state lives outside the process. For hard quotas, keep the authoritative state in Redis.

Token Bucket Implementation — the parent algorithm guide and atomic refill mechanics.
How to Choose Between Token Bucket and Leaky Bucket — sibling decision guide.
Redis Counter Architecture — how the authoritative store is built and scaled.
Distributed Algorithm Sync — clock skew and cross-node consistency.