Alerting & SLOs for Rate Limiters
The hardest thing about alerting on a rate limiter is that its normal, correct behaviour — returning 429 Too Many Requests and rejecting traffic — is indistinguishable, to a naïve alert, from a serious incident. This guide sits under the Observability & Operations area and exists to resolve that ambiguity: to define when a 429 spike is the system working as designed (shedding abuse) versus the system failing (misconfiguration blocking legitimate users, or worse, the limiter failing open and enforcing nothing). Get the distinction wrong and you either page on every scraper that hits your API, or you sleep through a real outage because the limiter’s metrics looked “busy.”
The answer is not “alert on 429 rate.” It is to alert on the meaning of the 429 rate — sliced by who is being blocked, correlated with store health, and measured against a budget you decided in advance. This guide covers the healthy-vs-broken decision table, an availability SLO and error budget for the limiter path, how to detect the silent fail-open, and multi-window burn-rate alerting that pages on real budget burn without firing on noise.
Healthy 429 vs broken 429
A 429 is the limiter doing exactly what you built it to do. The question is never “are we returning 429s?” — you always are — but “are the right clients being rejected for the right reason?” The decision table below is the core of all limiter alerting; every rule later in this guide is a mechanization of one of its rows.
| Symptom | Likely cause | Action |
|---|---|---|
429 spike concentrated in anonymous/free, store healthy, paid tiers unaffected |
Abuse, scraping, or a runaway free client being shed correctly | None — this is the limiter working. Do not page. |
429 spike hitting pro/enterprise tiers, store healthy |
Limit set too tight, bad deploy lowered a quota, or a noisy-neighbour bug | Alert (ticket/Slack). Legitimate paying users are blocked — fix the config. |
| 429 rate climbs fleet-wide across all tiers at once, store healthy | A global limit misconfigured, or a shared quota exhausted by one tenant | Alert. Check the most recent limit/config change. |
| 429 rate drops to ~0 while traffic is steady | Limiter failing open — store unreachable, admitting everything unmetered | Page. No limiting is happening; backend is exposed. |
| Store error rate up, decision latency p99 climbing, 429s erratic | Backing store (Redis) degrading; limiter about to fail open | Page. The dependency is failing; the limiter follows. |
429s with Retry-After clients ignore, retry storms |
Client backoff broken, not a limiter fault | Investigate client; limiter is correct. |
| Decision latency p99 climbs but 429 rate and store errors are flat | Limiter hot path slow — store round-trip, GC pause, or connection-pool starvation | Ticket. Degrades every request behind it; chase before it becomes fail-open. |
| 429s spike on one route while sibling routes are flat, store healthy | A route-specific limit was lowered, or one endpoint got a traffic shift | Ticket. Diff the route’s limit config against last deploy. |
429 rate normal but ratelimit_active_keys jumps sharply on anonymous |
Distributed scraper / credential-stuffing fanning across many keys | Investigate (security), not a limiter fault — the limiter is shedding correctly. |
| Fail-open counter ticks briefly then self-clears, store recovers | Transient store blip (failover, brief network partition) absorbed correctly | None if within budget — this is what the error budget is for. |
| 429s correlate exactly with deploys, then subside | Counter reset misread, or a config reload momentarily applied stale limits | Ticket. Verify queries use rate(); check config-reload metric. |
The two rows that must page are the bottom-heavy ones: a drop in 429s and a rise in store errors. Counterintuitively, silence from the limiter is more dangerous than noise. That is why fail-open detection, below, is the single most important alert in this whole area. Note how many rows resolve to “ticket” or “no action” — the decision table’s real job is to keep the page list short, because every false page erodes trust in the true ones.
An SLO and error budget for the limiter
A rate limiter is on the critical path of every request, so it needs its own service-level objective — separate from the API’s overall SLO. The subtlety: a 429 is not an error against the limiter’s SLO when the client genuinely exceeded its quota. The limiter’s job includes returning 429s. So the SLO is not “99.9% of requests are not 429.”
Define the limiter’s availability as the fraction of requests for which the limiter rendered a correct, timely decision — where the bad events are:
- The limiter could not decide and failed open (admitted traffic it should have evaluated against the store).
- The limiter could not decide and failed closed (returned 429/503 to a client that was under its quota because the store was down).
- The decision exceeded a latency objective (e.g. p99 > 5 ms), degrading every request behind it.
A reasonable starting SLO: 99.95% of requests get a correct limiter decision within 5 ms, measured over a 30-day window. That budget — 0.05% — is about 21.6 minutes of “incorrect or slow decisions” per month. Fail-open events, fail-closed-under-quota events, and over-latency decisions all draw down the same budget.
# Good events: requests that got a correct decision within the latency objective.
# Bad events here are fail-open + store errors; over-latency is tracked separately.
1 - (
sum(rate(ratelimit_fail_open_total[30d]))
+ sum(rate(ratelimit_store_errors_total[30d]))
) / sum(rate(ratelimit_requests_total[30d]))
The point of expressing it as a budget is restraint: you do not page on a single fail-open event, you page when fail-opens are burning the budget fast enough that the month’s allowance will be gone before the window resets. That is what burn-rate alerting, below, encodes.
Choosing the SLI: what counts as a “good” decision
An error budget is only as honest as the indicator underneath it, and the limiter’s SLI has a sharp edge: the obvious denominator (all requests) and the obvious numerator (non-429 requests) give you a wrong SLI, because it punishes the limiter for doing its job. A correctly-shed scraper is a successful limiter decision, not a failure. The SLI must count correct, timely decisions, where “correct” means the limiter actually consulted its authority (the store) and rendered the verdict the configuration intended, and “timely” means under the latency objective. Concretely, a request is a good event when the limiter evaluated it against live state within the latency budget — regardless of whether the verdict was allow or deny. It is a bad event only when the mechanism failed: it gave up and admitted without checking (fail-open), it rejected an under-quota client because it could not check (fail-closed-under-quota), or it answered too slowly.
This framing has a consequence worth internalizing: you cannot compute the limiter’s true SLI from the 429 rate alone, because the 429 rate cannot distinguish a correct deny from a fail-closed deny. You need the fail-open and store-error counters — the same ones the metrics guide insists on — as the numerator’s bad events. The SLI is a mechanism-health indicator wearing the clothes of an availability number, which is exactly why it survives the limiter’s job of rejecting traffic.
Sizing the budget
A target is a business decision, not a default, and the right number depends on what sits behind the limiter. A limiter guarding a billing-critical write path — where a fail-open admits unmetered revenue-affecting traffic — wants a tight objective like 99.99% (about 4.3 minutes of bad decisions per 30 days) because each bad decision is expensive. A limiter in front of a best-effort public read API can live at 99.9% (about 43 minutes per 30 days) because a brief fail-open there protects availability and costs little. The middle ground, 99.95% (~21.6 minutes/30 days), is a sane default for a typical production API. The table below makes the budget concrete so the burn-rate math later has something to divide into:
| SLO target | Bad-decision budget / 30 days | Fits |
|---|---|---|
| 99.9% | ~43 min | Best-effort public reads; fail-open is the safe direction |
| 99.95% | ~21.6 min | Typical production API; balanced default |
| 99.99% | ~4.3 min | Billing-critical or fail-closed paths where each bad decision costs money |
Pick the target deliberately and write it down, because every burn-rate threshold in the next section is derived from it. Changing the SLO silently changes when you get paged.
Detecting fail-open
Fail-open is the failure that hides. When the store is unreachable and the limiter admits everything, the visible symptoms are all absences: 429s vanish, blocks go to zero, the dashboard looks calm. You cannot alert on the absence of a problem; you must alert on the positive signal that you instrumented for exactly this. That signal is ratelimit_fail_open_total, which — as covered in Metrics & Instrumentation — increments every time the limiter admits a request because it could not reach the store.
# Any sustained fail-open is an incident: the limiter is enforcing nothing.
sum(rate(ratelimit_fail_open_total[5m])) > 0
Pair it with a corroborating signal so a single blip does not page: require both a non-zero fail-open rate and an elevated store-error rate, sustained for a couple of minutes. The exact rule YAML is in Alerting on 429 error rates. If you fail closed instead, the inverse holds: watch for a 429 spike that correlates with store errors, because those 429s are going to clients who were under quota.
Multi-window, multi-burn-rate alerting
The standard tool for alerting on an error budget is the multi-window multi-burn-rate alert, drawn from Google’s SRE practice. A single threshold (“alert if fail-open rate > X”) forces an impossible tradeoff: set it sensitive and it pages on transient blips; set it lax and it misses slow burns. Multi-burn-rate solves this by combining a long and a short window at each severity.
The mechanism: define a burn rate as how fast you are consuming the error budget relative to the steady rate that would exactly exhaust it over the SLO window. A burn rate of 1 spends the whole month’s budget in exactly a month; a burn rate of 14.4 would spend it in two days.
| Severity | Burn rate | Long window | Short window | Meaning |
|---|---|---|---|---|
| Page | 14.4 | 1 hour | 5 min | Budget gone in ~2 days at this rate — wake someone |
| Page | 6 | 6 hours | 30 min | Budget gone in ~5 days — sustained serious burn |
| Ticket | 3 | 24 hours | 2 hours | Slow burn; budget gone in ~10 days if unchecked |
| Ticket | 1 | 72 hours | 6 hours | Gentle erosion; investigate during business hours |
The short window gates the long window: the alert fires only when both the long-window burn rate and the short-window burn rate exceed the threshold. The long window ensures the burn is real and sustained; the short window ensures it is still happening now so the alert resolves quickly once you fix it. This is what stops a brief spike from paging and stops a resolved incident from lingering as a stuck alert.
Where do the magic numbers come from? They are not arbitrary — each pairs a budget-consumption fraction with a detection window drawn from the SRE workbook. The 14.4 figure is “consume 2% of a 30-day budget in 1 hour”: 0.02 × (30 days / 1 hour) = 0.02 × 720 = 14.4. The 6 figure is “consume 5% in 6 hours”: 0.05 × (720 / 6) = 6. Holding the budget fraction fixed and varying the window is what lets one severity catch a fast burn quickly and another catch a slow burn before it quietly drains the month. The fractions (2%, 5%, 10%) decide how much budget you are willing to spend before paging; the windows decide how fast you notice.
Expressed as a query, a burn rate is just the SLI’s bad-event ratio over a window divided by the budget you allotted. With a 99.95% SLO (budget = 0.0005), the page-severity rule looks like this:
# Burn rate = (observed bad-event ratio) / (error budget).
# Fires only when BOTH windows exceed the threshold (short gates long).
(
sum(rate(ratelimit_fail_open_total[1h]))
+ sum(rate(ratelimit_store_errors_total[1h]))
) / sum(rate(ratelimit_requests_total[1h])) / 0.0005 > 14.4
and
(
sum(rate(ratelimit_fail_open_total[5m]))
+ sum(rate(ratelimit_store_errors_total[5m]))
) / sum(rate(ratelimit_requests_total[5m])) / 0.0005 > 14.4
The and is the gate: the 1-hour term confirms the burn is sustained, the 5-minute term confirms it is still live. Drop the short term and a resolved incident keeps paging for an hour after you fixed it; drop the long term and a 90-second blip wakes you for nothing. Worth a sanity check on the numbers: a 14.4× burn means you are spending budget 14.4 times faster than the steady rate that would exactly exhaust it over 30 days, so the month’s 21.6 minutes is gone in roughly 30 days / 14.4 ≈ 2 days — which is exactly the “wake someone” threshold the table promises. The concrete Prometheus rule YAML for every severity row, with recording rules to keep the queries cheap, is in the child guide below.
Routing and severity
An alert that pages the wrong person at the wrong urgency trains people to ignore it. Map the decision table to routing:
- Page (P1/P2): fail-open detected, store-error burn rate high, fleet-wide 429 burn affecting paid tiers. These mean the limiter is broken or paying customers are blocked.
- Ticket / Slack (P3): elevated 429s confined to one paid route, slow budget burn, p99 latency creeping. Real but not “wake up.”
- Suppressed / no alert: 429 spikes confined to
anonymous/freewith a healthy store. This is the limiter succeeding; alerting on it is the classic false-positive that erodes trust in the whole alerting system.
Always include the offending route, tier, and current burn rate in the alert annotation so the responder starts triage from the decision table, not from a blank dashboard.
Alert fatigue and the cost of a false page
The decision table’s discipline only holds if you treat every alert as a liability until proven otherwise. Alert fatigue is not a soft “people get annoyed” problem; it is a hard failure mode where a responder, conditioned by a stream of pages that turned out to be healthy shedding, dismisses the one page that was a real fail-open. A limiter is unusually prone to this because its correct behavior — returning 429s — is the exact shape of its failure, so naïve alerts fire constantly and correctly look false. The defenses are concrete:
- Multi-window gating (above) is the first line: it is engineered to suppress the transient spikes that make up most false pages.
- Slice before you threshold. An alert on raw fleet 429 rate is almost pure noise. The same alert sliced to
tier=~"pro|enterprise"fires only on the population whose 429s actually mean something. Bake the slice into the alert expression, not into the responder’s head. - Require corroboration for the scary alerts. Fail-open should page on
fail_open_rate > 0 AND store_error_rate elevated, sustained — never on a single fail-open sample, which a routine store failover will produce. - Set
for:durations that match the window. Afor: 2mon a 5-minute-window rule kills the sub-blip without delaying a real incident meaningfully. - Run an alert review. Periodically pull every page from the last month and label it actionable or not. A page that was “not actionable” three times running is a rule to tighten or delete, not tolerate.
The governing principle: the page channel is for the two failure rows of the decision table and nothing else. Everything that is “real but not wake-someone” goes to a ticket or Slack queue; everything that is “the limiter succeeding” is suppressed entirely. Each misrouted alert spends trust you cannot quickly earn back.
Wiring alerts to runbooks
An alert that fires at 3 a.m. should not require the responder to reconstruct this guide from memory. Every limiter alert should carry a runbook_url annotation pointing to a short, alert-specific procedure — and the procedure should be a direct walk down the relevant decision-table row, not a generic “investigate the limiter.”
# Each alert links to a runbook keyed to its decision-table row.
annotations:
summary: "Limiter fail-open on {{ $labels.route }} ({{ $labels.tier }})"
description: "fail_open_rate={{ $value | humanize }}/s; store errors elevated."
runbook_url: "https://runbooks.internal/ratelimit/fail-open"
dashboard_url: "https://grafana.internal/d/ratelimit/overview?var-route={{ $labels.route }}"
A good fail-open runbook reads almost verbatim from this guide: confirm ratelimit_fail_open_total is climbing and ratelimit_store_errors_total corroborates; check store (Redis) health directly because Redis counter architecture is the dependency that determines fail-open behaviour; decide whether to keep failing open (availability) or flip to fail-closed (protect the backend) per the path’s criticality; and once the store recovers, verify the fail-open rate returns to zero before resolving. The runbook closes the loop the whole area opens: signals feed metrics, metrics feed alerts, alerts feed a procedure, and the procedure points back at the store whose health started it. The exact rule YAML, annotations, and the per-severity runbook stubs are assembled in the child guide below.
Child guides
- Alerting on 429 Error Rates — the concrete Prometheus alerting rules: 429-ratio spike by tier, store-error rate, fail-open detection, and the full multi-window multi-burn-rate rule set with routing and severity.
Related
- Observability & Operations — the parent area covering headers, metrics, and alerting.
- Metrics & Instrumentation — the metrics every alert here queries, including the fail-open counter.
- Grafana Rate Limit Dashboards — the panels you look at once an alert fires.
- Redis Counter Architecture — the store whose health determines fail-open behaviour.