Leaky Bucket Overflow and Backpressure
A leaky bucket is only as good as its behavior when the queue is full — the steady-state case is trivial, but the overflow case is where you either shed load gracefully or amplify an incident. This guide drills into the overflow and backpressure mechanics that the Leaky Bucket Mechanics reference introduces: which request you drop, which status code you return, how you compute Retry-After from live queue depth, and how you size the queue so it never grows unbounded. Get these wrong and a transient spike becomes a retry storm; get them right and the queue absorbs the spike and signals clients to slow down before they pile on.
The problem in concrete numbers
Picture a webhook-delivery service with a downstream that safely sustains 200 requests/second (drain rate R = 200/s) and a queue sized for 2 seconds of buffer, so capacity = 400. Upstream throws a 1,000 rps burst for 5 seconds. The queue fills in 400 / (1000 − 200) = 0.5 s, then sits pinned at 400 with 800 rps of overflow for the rest of the burst. The questions that decide whether this is a non-event or an outage: which 800 do you drop, what do you tell their clients, and does your queue actually stop at 400 or quietly grow to OOM the process?
Overflow strategies compared
When the queue is full, you must choose what to sacrifice. The four policies below differ in which request dies, whether the client learns about it, and what it costs to operate.
| Strategy | What it does | Client signal | Best for | Risk |
|---|---|---|---|---|
| Reject-newest (tail-drop) | Drop the arriving request; keep the queue intact | 429 + Retry-After |
Most APIs; clients can retry | Newest requests starve under sustained overload |
| Drop-oldest (head-drop) | Evict the front of the queue to admit the newcomer | None to the dropped client (already accepted) | Real-time feeds where stale data is worthless | Silently breaks the “accepted” contract; hard to debug |
| Reject-newest + 503 | Drop the arrival, signal “system overloaded” not “you’re over quota” | 503 + Retry-After |
Capacity overload, not per-client abuse | Clients may treat 503 as a hard outage |
| Dead-letter routing | Move overflow to a DLQ for async reprocessing | 202 Accepted (deferred) |
Webhooks, jobs that must not be lost | Adds infra; not viable for synchronous reads |
Selection rules:
- Default to reject-newest with
429for client-facing APIs: it preserves FIFO fairness for already-queued work and gives the caller an actionable retry hint. - Use
503instead of429when the cause is global capacity overload rather than one client exceeding its share — the semantics differ, and so should client retry behavior. See the FAQ for the exact distinction. - Use dead-letter routing only when losing a request is unacceptable (payment webhooks, billing events) and the work can be processed asynchronously.
- Avoid drop-oldest unless staleness genuinely makes old entries worthless; it violates the implicit promise that an enqueued request will be served.
Backpressure signaling: 429 vs 503 and Retry-After
Backpressure is the explicit signal that tells a client to slow down before it makes things worse. Two decisions matter: the status code and the Retry-After value.
429 Too Many Requestsmeans this caller exceeded its allowance. Pair it withX-RateLimit-*headers so the client knows its budget.503 Service Unavailablemeans the system as a whole is saturated and can’t take the request right now, independent of who’s asking. Use it when overall queue depth — not a per-key counter — is the cause.
Either way, derive Retry-After from live queue depth, not a constant. A fixed Retry-After: 1 sends every rejected client back at the same instant, producing a synchronized retry wave (a thundering herd). The honest estimate is “how long until a drain slot frees up,” which is queue_depth / drain_rate, plus jitter to desynchronize retries.
// Compute Retry-After (seconds) from current queue depth and drain rate.
// Returns an integer >= 1, since Retry-After in seconds form is whole seconds.
function retryAfterSeconds(queueDepth: number, drainRatePerSec: number): number {
// Time for the queue to clear enough for this request to be admitted.
const drainSeconds = queueDepth / drainRatePerSec;
// Add bounded jitter (up to 20%) to desynchronize clients and avoid a retry wave.
const jitter = drainSeconds * Math.random() * 0.2;
return Math.max(1, Math.ceil(drainSeconds + jitter));
}
// Example: depth=400, drain=200/s -> ~2s base, 2-3s after jitter and ceil.
// On overflow:
// res.set('Retry-After', String(retryAfterSeconds(depth, 200)));
// res.status(isGlobalOverload ? 503 : 429).end();
For clients consuming this header, the parsing rules (numeric seconds vs HTTP-date) and browser-side backoff are covered in handling 429 HTTP responses.
Bounded-queue sizing math
The cardinal rule: the queue must be bounded, and the bound must be a deliberate number, not whatever the heap allows. An unbounded queue doesn’t shed load — it converts a traffic spike into a latency spike and then an out-of-memory crash.
Size the queue from the maximum acceptable added latency, because a request at the back of a full queue waits capacity / drain_rate seconds before it drains:
capacity = max_acceptable_latency_seconds × drain_rate
With R = 200/s and a 2-second latency budget, capacity = 400. Memory follows directly: capacity × bytes_per_queued_item. If each queued item holds a 2 KB request envelope, 400 slots cost ~800 KB per queue — multiply by the number of distinct keys you queue per-key. A per-IP queue across 50,000 active IPs at even 50 slots each is 2.5M envelopes; that is why per-key leaky buckets usually store only a counter or request id, deferring the payload elsewhere.
Adaptive drain rate
A fixed drain rate wastes headroom when downstream is healthy and overruns it when downstream is struggling. An adaptive drain ties R to a live downstream signal — p99 latency, error rate, or connection-pool saturation — and backs off automatically.
// Adapt drain rate from downstream health. Call on each drain tick or on a timer.
// Clamps between a floor (never fully stall) and a ceiling (downstream's safe max).
function adaptDrainRate(current: number, downstreamP99Ms: number): number {
const FLOOR = 20; // requests/sec — minimum forward progress
const CEILING = 200; // requests/sec — downstream's measured safe maximum
const TARGET_P99 = 150; // ms — SLO for the downstream call
// Multiplicative decrease when slow, additive increase when healthy (AIMD).
if (downstreamP99Ms > TARGET_P99) {
return Math.max(FLOOR, Math.floor(current * 0.75)); // back off fast
}
return Math.min(CEILING, current + 10); // recover gently
}
This AIMD shape is the same congestion-control logic TCP uses: cut hard on a bad signal, recover slowly on a good one. Pair it with the distributed-state caveats in distributed algorithm sync when the queue and its drain authority span multiple nodes — only one node should own the drain to avoid double-counting.
Operator checklist
- Set an explicit, finite
capacity - Derive
capacityfrom a latency budget (max_latency × drain_rate - Return
429for per-key exhaustion and503 - Compute
Retry-After - Emit
leaky_bucket_overflow_totalandleaky_bucket_queue_depth
Verification & testing
Drive a burst well above the drain rate and confirm three things: the queue stops at capacity, overflow returns the right status with a sane Retry-After, and memory stays flat.
# 1000 rps for 5s against a service draining at 200 rps, capacity 400.
# Expect ~200 rps of 2xx, the rest 429 with Retry-After ~2-3s, flat RSS.
hey -z 5s -q 1000 -c 100 https://api.example.com/v1/webhooks/deliver \
| grep -E "Status code|Requests/sec"
# Watch the bound hold and overflow climb while RSS stays flat:
watch -n1 'curl -s localhost:9464/metrics | grep -E "leaky_bucket_(queue_depth|overflow_total)"'
If queue depth climbs past capacity or RSS grows without bound, your queue isn’t actually bounded — fix that before anything else.
Frequently Asked Questions
Should overflow return 429 or 503?
Return 429 when a specific caller exceeded its allowance, and 503 when the whole system is saturated regardless of who's calling. A full shared queue caused by aggregate load is closer to 503; a per-key queue a single client filled is 429. Both should carry a Retry-After.
How do I compute Retry-After for a full queue?
Use queue_depth / drain_rate — the time until a slot frees — then add up to ~20% jitter and round up to a whole second. A constant value sends every rejected client back simultaneously and triggers a synchronized retry wave.
What's the difference between tail-drop and reject-newest?
They're the same policy: when the queue is full, the arriving (newest) request is dropped and the queued work is preserved. The opposite is drop-oldest (head-drop), which evicts the front of the queue to admit the newcomer — useful only when stale entries are worthless.
How large should the queue be?
Derive it from your latency budget: capacity = max_acceptable_latency_seconds × drain_rate. A request at the back of a full queue waits capacity / drain_rate seconds, so the capacity directly sets your worst-case added latency. Never leave it unbounded.
When is a dead-letter queue worth the extra infrastructure?
When losing an overflowed request is unacceptable and the work can be processed asynchronously — payment webhooks, billing events, audit records. For synchronous reads where the client is waiting, a DLQ adds nothing; reject with 429 and let the client retry.
Related
- Leaky Bucket Mechanics — the parent reference on queue depth, drain rate, and constant egress.
- How to Choose Between Token Bucket and Leaky Bucket — when smoothing beats burst tolerance.
- Handling 429 HTTP Responses — how clients parse Retry-After and back off.
- Distributed Algorithm Sync — single drain authority and consistency across nodes.
- Token Bucket Implementation — the burst-tolerant alternative model.