What is the difference between tail-drop and reject-newest?

They are the same policy: when the queue is full the arriving newest request is dropped and queued work is preserved. The opposite is drop-oldest, which evicts the front of the queue to admit the newcomer, useful only when stale entries are worthless.

Leaky Bucket Overflow and Backpressure

Q: Should overflow return 429 or 503?

Return 429 when a specific caller exceeded its allowance and 503 when the whole system is saturated regardless of who is calling. A full shared queue caused by aggregate load is closer to 503; a per-key queue a single client filled is 429. Both should carry a Retry-After header.

Q: How do I compute Retry-After for a full queue?

Use queue_depth divided by drain_rate, the time until a slot frees, then add up to about 20 percent jitter and round up to a whole second. A constant value sends every rejected client back simultaneously and triggers a synchronized retry wave.

Q: How large should the queue be?

Derive it from your latency budget: capacity equals max acceptable latency in seconds times drain rate. A request at the back of a full queue waits capacity divided by drain rate seconds, so capacity directly sets worst-case added latency. Never leave it unbounded.

A leaky bucket is only as good as its behavior when the queue is full — the steady-state case is trivial, but the overflow case is where you either shed load gracefully or amplify an incident. This guide drills into the overflow and backpressure mechanics that the Leaky Bucket Mechanics reference introduces: which request you drop, which status code you return, how you compute Retry-After from live queue depth, and how you size the queue so it never grows unbounded. Get these wrong and a transient spike becomes a retry storm; get them right and the queue absorbs the spike and signals clients to slow down before they pile on.

The problem in concrete numbers

Picture a webhook-delivery service with a downstream that safely sustains 200 requests/second (drain rate R = 200/s) and a queue sized for 2 seconds of buffer, so capacity = 400. Upstream throws a 1,000 rps burst for 5 seconds. The queue fills in 400 / (1000 − 200) = 0.5 s, then sits pinned at 400 with 800 rps of overflow for the rest of the burst. The questions that decide whether this is a non-event or an outage: which 800 do you drop, what do you tell their clients, and does your queue actually stop at 400 or quietly grow to OOM the process?

Overflow strategies compared

When the queue is full, you must choose what to sacrifice. The four policies below differ in which request dies, whether the client learns about it, and what it costs to operate.

Strategy	What it does	Client signal	Best for	Risk
Reject-newest (tail-drop)	Drop the arriving request; keep the queue intact	`429` + `Retry-After`	Most APIs; clients can retry	Newest requests starve under sustained overload
Drop-oldest (head-drop)	Evict the front of the queue to admit the newcomer	None to the dropped client (already accepted)	Real-time feeds where stale data is worthless	Silently breaks the “accepted” contract; hard to debug
Reject-newest + 503	Drop the arrival, signal “system overloaded” not “you’re over quota”	`503` + `Retry-After`	Capacity overload, not per-client abuse	Clients may treat 503 as a hard outage
Dead-letter routing	Move overflow to a DLQ for async reprocessing	`202 Accepted` (deferred)	Webhooks, jobs that must not be lost	Adds infra; not viable for synchronous reads

Selection rules:

Default to reject-newest with 429 for client-facing APIs: it preserves FIFO fairness for already-queued work and gives the caller an actionable retry hint.
Use 503 instead of 429 when the cause is global capacity overload rather than one client exceeding its share — the semantics differ, and so should client retry behavior. See the FAQ for the exact distinction.
Use dead-letter routing only when losing a request is unacceptable (payment webhooks, billing events) and the work can be processed asynchronously.
Avoid drop-oldest unless staleness genuinely makes old entries worthless; it violates the implicit promise that an enqueued request will be served.

Backpressure signaling: 429 vs 503 and Retry-After

Backpressure is the explicit signal that tells a client to slow down before it makes things worse. Two decisions matter: the status code and the Retry-After value.

429 Too Many Requests means this caller exceeded its allowance. Pair it with X-RateLimit-* headers so the client knows its budget.
503 Service Unavailable means the system as a whole is saturated and can’t take the request right now, independent of who’s asking. Use it when overall queue depth — not a per-key counter — is the cause.

Either way, derive Retry-After from live queue depth, not a constant. A fixed Retry-After: 1 sends every rejected client back at the same instant, producing a synchronized retry wave (a thundering herd). The honest estimate is “how long until a drain slot frees up,” which is queue_depth / drain_rate, plus jitter to desynchronize retries.

// Compute Retry-After (seconds) from current queue depth and drain rate.
// Returns an integer >= 1, since Retry-After in seconds form is whole seconds.
function retryAfterSeconds(queueDepth: number, drainRatePerSec: number): number {
  // Time for the queue to clear enough for this request to be admitted.
  const drainSeconds = queueDepth / drainRatePerSec;
  // Add bounded jitter (up to 20%) to desynchronize clients and avoid a retry wave.
  const jitter = drainSeconds * Math.random() * 0.2;
  return Math.max(1, Math.ceil(drainSeconds + jitter));
}

// Example: depth=400, drain=200/s -> ~2s base, 2-3s after jitter and ceil.
// On overflow:
//   res.set('Retry-After', String(retryAfterSeconds(depth, 200)));
//   res.status(isGlobalOverload ? 503 : 429).end();

For clients consuming this header, the parsing rules (numeric seconds vs HTTP-date) and browser-side backoff are covered in handling 429 HTTP responses.

Bounded-queue sizing math

The cardinal rule: the queue must be bounded, and the bound must be a deliberate number, not whatever the heap allows. An unbounded queue doesn’t shed load — it converts a traffic spike into a latency spike and then an out-of-memory crash.

Size the queue from the maximum acceptable added latency, because a request at the back of a full queue waits capacity / drain_rate seconds before it drains:

capacity = max_acceptable_latency_seconds × drain_rate

With R = 200/s and a 2-second latency budget, capacity = 400. Memory follows directly: capacity × bytes_per_queued_item. If each queued item holds a 2 KB request envelope, 400 slots cost ~800 KB per queue — multiply by the number of distinct keys you queue per-key. A per-IP queue across 50,000 active IPs at even 50 slots each is 2.5M envelopes; that is why per-key leaky buckets usually store only a counter or request id, deferring the payload elsewhere.

Adaptive drain rate

A fixed drain rate wastes headroom when downstream is healthy and overruns it when downstream is struggling. An adaptive drain ties R to a live downstream signal — p99 latency, error rate, or connection-pool saturation — and backs off automatically.

// Adapt drain rate from downstream health. Call on each drain tick or on a timer.
// Clamps between a floor (never fully stall) and a ceiling (downstream's safe max).
function adaptDrainRate(current: number, downstreamP99Ms: number): number {
  const FLOOR = 20;      // requests/sec — minimum forward progress
  const CEILING = 200;   // requests/sec — downstream's measured safe maximum
  const TARGET_P99 = 150; // ms — SLO for the downstream call
  // Multiplicative decrease when slow, additive increase when healthy (AIMD).
  if (downstreamP99Ms > TARGET_P99) {
    return Math.max(FLOOR, Math.floor(current * 0.75)); // back off fast
  }
  return Math.min(CEILING, current + 10);               // recover gently
}

This AIMD shape is the same congestion-control logic TCP uses: cut hard on a bad signal, recover slowly on a good one. Pair it with the distributed-state caveats in distributed algorithm sync when the queue and its drain authority span multiple nodes — only one node should own the drain to avoid double-counting.

Operator checklist

Set an explicit, finite capacity Set an explicit, finite `capacity`; verify the queue rejects at the bound under a load test rather than growing.
Derive capacity from a latency budget (max_latency × drain_rate Derive `capacity` from a latency budget (`max_latency × drain_rate`), not a round number.
Choose an overflow policy per endpoint (reject-newest for reads, dead-letter for must-not-lose writes) and document it.
Return 429 for per-key exhaustion and 503 Return `429` for per-key exhaustion and `503` for global overload; don't conflate them.
Compute Retry-After Compute `Retry-After` from live queue depth and add jitter to prevent synchronized retries.
Emit leaky_bucket_overflow_total and leaky_bucket_queue_depth Emit `leaky_bucket_overflow_total` and `leaky_bucket_queue_depth` metrics and alert when depth exceeds 80% of capacity.
Cap per-key memory: queue request ids or counters, not full payloads, for high-cardinality keys.
If draining across nodes, elect a single drain owner (distributed lock or single consumer).

Verification & testing

Drive a burst well above the drain rate and confirm three things: the queue stops at capacity, overflow returns the right status with a sane Retry-After, and memory stays flat.

# 1000 rps for 5s against a service draining at 200 rps, capacity 400.
# Expect ~200 rps of 2xx, the rest 429 with Retry-After ~2-3s, flat RSS.
hey -z 5s -q 1000 -c 100 https://api.example.com/v1/webhooks/deliver \
  | grep -E "Status code|Requests/sec"

# Watch the bound hold and overflow climb while RSS stays flat:
watch -n1 'curl -s localhost:9464/metrics | grep -E "leaky_bucket_(queue_depth|overflow_total)"'

If queue depth climbs past capacity or RSS grows without bound, your queue isn’t actually bounded — fix that before anything else.

Frequently Asked Questions

Should overflow return 429 or 503?

Return 429 when a specific caller exceeded its allowance, and 503 when the whole system is saturated regardless of who's calling. A full shared queue caused by aggregate load is closer to 503; a per-key queue a single client filled is 429. Both should carry a Retry-After.

How do I compute Retry-After for a full queue?

Use queue_depth / drain_rate — the time until a slot frees — then add up to ~20% jitter and round up to a whole second. A constant value sends every rejected client back simultaneously and triggers a synchronized retry wave.

What's the difference between tail-drop and reject-newest?

They're the same policy: when the queue is full, the arriving (newest) request is dropped and the queued work is preserved. The opposite is drop-oldest (head-drop), which evicts the front of the queue to admit the newcomer — useful only when stale entries are worthless.

How large should the queue be?

Derive it from your latency budget: capacity = max_acceptable_latency_seconds × drain_rate. A request at the back of a full queue waits capacity / drain_rate seconds, so the capacity directly sets your worst-case added latency. Never leave it unbounded.

When is a dead-letter queue worth the extra infrastructure?

When losing an overflowed request is unacceptable and the work can be processed asynchronously — payment webhooks, billing events, audit records. For synchronous reads where the client is waiting, a DLQ adds nothing; reject with 429 and let the client retry.

Leaky Bucket Mechanics — the parent reference on queue depth, drain rate, and constant egress.
How to Choose Between Token Bucket and Leaky Bucket — when smoothing beats burst tolerance.
Handling 429 HTTP Responses — how clients parse Retry-After and back off.
Distributed Algorithm Sync — single drain authority and consistency across nodes.
Token Bucket Implementation — the burst-tolerant alternative model.