Alerting on 429 Error Rates
This guide is the rule file: the actual Prometheus alerting YAML and PromQL that turn rate-limiter metrics into pages and tickets. It sits under the Alerting & SLOs guide, which explains why a 429 spike can be healthy and how the error budget works; here you write the four rules that matter — 429-ratio spike by tier, store-error rate, fail-open detection, and multi-window multi-burn-rate — and wire their severity and routing. Every rule queries the ratelimit_* series from Prometheus metrics for rate limiting.
The traffic you are alerting on
Take a limiter at 2,000 rps with a 99.95% / 5 ms decision SLO over 30 days — a 0.05% budget, about 21.6 minutes of bad decisions per month. Normal block rate sits at 3–5%, almost all in anonymous/free. The rules below must page when paying users get blocked or the limiter fails open, and stay silent when a scraper is being shed correctly.
| Alert | Fires when | Severity |
|---|---|---|
RateLimitFailOpen |
Fail-open rate > 0 sustained, with store errors | Page |
RateLimitStoreErrorBurn |
Store-error burn rate high (multi-window) | Page |
RateLimit429SpikePaid |
Block ratio high on pro/enterprise |
Page |
RateLimit429SpikeFree |
Block ratio high but only on anonymous/free |
Ticket / suppressed |
RateLimitBudgetBurnFast / Slow |
Error budget burning at 14.4× / 3× | Page / Ticket |
Operator checklist
- Confirm
ratelimit_requests_total,ratelimit_store_errors_total, andratelimit_fail_open_total - Write the 429-spike rule split by
key_class - Set
for: - Route by severity label to pager vs Slack; include
route,tier
Step 1 — Recording rules
Recording rules pre-compute the ratios so alert expressions stay readable and cheap. Evaluate them at the scrape interval.
groups:
- name: ratelimit_recording
interval: 15s
rules:
# Fleet block ratio, and the same split by key class.
- record: ratelimit:block_ratio:5m
expr: |
sum(rate(ratelimit_requests_total{decision="blocked"}[5m]))
/ sum(rate(ratelimit_requests_total[5m]))
- record: ratelimit:block_ratio_by_tier:5m
expr: |
sum by (tier) (rate(ratelimit_requests_total{decision="blocked"}[5m]))
/ sum by (tier) (rate(ratelimit_requests_total[5m]))
# SLO bad-event ratio: fail-open + store errors over total. One window per burn rule.
- record: ratelimit:slo_error_ratio:5m
expr: |
(sum(rate(ratelimit_fail_open_total[5m])) + sum(rate(ratelimit_store_errors_total[5m])))
/ sum(rate(ratelimit_requests_total[5m]))
- record: ratelimit:slo_error_ratio:1h
expr: |
(sum(rate(ratelimit_fail_open_total[1h])) + sum(rate(ratelimit_store_errors_total[1h])))
/ sum(rate(ratelimit_requests_total[1h]))
- record: ratelimit:slo_error_ratio:6h
expr: |
(sum(rate(ratelimit_fail_open_total[6h])) + sum(rate(ratelimit_store_errors_total[6h])))
/ sum(rate(ratelimit_requests_total[6h]))
Step 2 — Fail-open detection (the rule that must never be missing)
Fail-open is invisible except through its counter. Page on a sustained non-zero fail-open rate, corroborated by store errors so a single odd increment does not wake anyone.
- name: ratelimit_failopen
rules:
- alert: RateLimitFailOpen
expr: |
sum(rate(ratelimit_fail_open_total[5m])) > 0
and
sum(rate(ratelimit_store_errors_total[5m])) > 0
for: 2m
labels: { severity: page, component: rate-limiter }
annotations:
summary: "Rate limiter is failing open — no limiting in effect"
description: "fail-open rate {{ $value | humanize }}/s with store errors present. The backend is unprotected; investigate the store immediately."
Step 3 — 429 spike, split by who is blocked
The whole point of the Alerting & SLOs decision table is that a free-tier spike is fine and a paid-tier spike is not. Encode that split so you never page on healthy shedding.
- name: ratelimit_429
rules:
# Paid tiers blocked -> page. Legitimate paying users are being rejected.
- alert: RateLimit429SpikePaid
expr: |
ratelimit:block_ratio_by_tier:5m{tier=~"pro|enterprise"} > 0.05
for: 5m
labels: { severity: page, component: rate-limiter }
annotations:
summary: "Elevated 429s on paid tier {{ $labels.tier }}"
description: "Block ratio {{ $value | humanizePercentage }} on {{ $labels.tier }}. Likely a too-tight limit or a bad config change."
# Free/anonymous spike -> ticket only. Usually abuse being shed correctly.
- alert: RateLimit429SpikeFree
expr: |
ratelimit:block_ratio_by_tier:5m{tier=~"anonymous|free"} > 0.5
for: 15m
labels: { severity: ticket, component: rate-limiter }
annotations:
summary: "Sustained heavy shedding on {{ $labels.tier }}"
description: "Block ratio {{ $value | humanizePercentage }} on {{ $labels.tier }} — likely abuse. Confirm it is not a misclassified legitimate client."
Step 4 — Multi-window multi-burn-rate budget alerts
This is the rule set that protects the error budget without false positives. Each severity combines a long window (is the burn real and sustained?) with a short window (is it still happening now?). Both must exceed the burn-rate threshold, expressed as threshold = burn_rate × (1 − SLO) = burn_rate × 0.0005.
- name: ratelimit_burnrate
rules:
# Fast burn: 14.4x. At this rate the 30-day budget is gone in ~2 days. Page.
- alert: RateLimitBudgetBurnFast
expr: |
ratelimit:slo_error_ratio:1h > (14.4 * 0.0005)
and
ratelimit:slo_error_ratio:5m > (14.4 * 0.0005)
for: 2m
labels: { severity: page, component: rate-limiter }
annotations:
summary: "Rate limiter error budget burning fast (14.4x)"
description: "Limiter SLO error ratio {{ $value | humanizePercentage }} over 1h and 5m. Budget exhausts in ~2 days at this rate."
# Slow burn: 3x. Budget gone in ~10 days. Ticket, not a page.
- alert: RateLimitBudgetBurnSlow
expr: |
ratelimit:slo_error_ratio:6h > (3 * 0.0005)
and
ratelimit:slo_error_ratio:1h > (3 * 0.0005)
for: 15m
labels: { severity: ticket, component: rate-limiter }
annotations:
summary: "Rate limiter error budget burning slowly (3x)"
description: "Limiter SLO error ratio {{ $value | humanizePercentage }} over 6h and 1h. Investigate during business hours."
The short window in each rule is what makes the alert resolve quickly: once you fix the store, the 5 m / 1 h ratio falls below threshold within minutes and the page clears, even though the long window is still elevated.
Step 5 — Store-error rate
A standalone store-error rule catches degradation before it becomes fail-open, giving you a head start.
- name: ratelimit_store
rules:
- alert: RateLimitStoreErrorBurn
expr: |
sum(rate(ratelimit_store_errors_total[5m]))
/ sum(rate(ratelimit_requests_total[5m])) > 0.01
for: 5m
labels: { severity: page, component: rate-limiter }
annotations:
summary: "Rate limiter store error rate above 1%"
description: "Store-error ratio {{ $value | humanizePercentage }}. The limiter is degrading and may fail open."
Step 6 — Routing and severity
Route on the severity label in Alertmanager: page to the pager, ticket to Slack/issue tracker. Group by component so a store incident does not fan out into five simultaneous pages.
route:
group_by: ["component"]
routes:
- matchers: [severity="page"]
receiver: pagerduty
group_wait: 30s
- matchers: [severity="ticket"]
receiver: slack-rate-limiting
group_wait: 5m
Verification & testing
Validate the rule syntax and logic, then prove it in staging.
# 1. Lint the rules.
promtool check rules ratelimit-alerts.yml
# 2. Unit-test the burn-rate logic with synthetic series.
promtool test rules ratelimit-alerts.test.yml
# 3. In staging, force a store outage (block Redis) and confirm THIS goes non-zero:
sum(rate(ratelimit_fail_open_total[5m]))
# RateLimitFailOpen should fire within ~2m, and RateLimit429SpikeFree should NOT.
The test that matters most: trigger a free-tier flood and confirm no page fires, then trigger a store outage and confirm RateLimitFailOpen fires. An alerting setup that pages on healthy shedding is worse than none, because it trains responders to ignore the limiter.
Gotchas & edge cases
- Never page on raw 429 rate. Always split by
key_class. A fleet-wide 429 threshold pages on every scraper and gets muted within a week. - The fail-open rule needs the fail-open counter. If the limiter does not increment
ratelimit_fail_open_totalon the store-error path, this entire guide is blind. Verify that instrumentation first. - Burn-rate thresholds scale with your SLO. The
0.0005factor is1 − 0.9995. Change the SLO, change every threshold, or the burn rates silently mean something else. for:must be shorter than the short window’s signal. Afor: 10mon a 5 m short window can delay a real page past the point of usefulness. Keepfor:at 2 m for page-severity burn alerts.- Recording rules avoid re-computation drift. If two alerts compute the block ratio inline with slightly different expressions, they will disagree. Compute once in a recording rule.
- Mind counter resets at deploy.
rate()handles resets, but a rolling deploy can briefly inflate fail-open if the limiter starts before the store connection is ready. A 2 mfor:absorbs this.
Frequently Asked Questions
Why not just alert when the 429 rate crosses a threshold?
Because a high 429 rate is often the limiter working correctly — shedding abuse on the free tier. A flat threshold pages on every scraper and trains responders to ignore the alert. Split by key_class so paid-tier blocks page and free-tier shedding stays quiet, and reserve pages for fail-open and store errors.
What burn rates and windows should I use?
The Google SRE defaults work well: 14.4× over 1h/5m for a fast-burn page, and 3× over 6h/1h for a slow-burn ticket. The threshold for each is the burn rate times (1 − SLO). Both the long and short window must exceed the threshold for the alert to fire, which suppresses blips and lets the alert resolve quickly.
How do I alert on fail-open when its symptom is the absence of 429s?
You cannot alert on an absence, so you alert on the positive ratelimit_fail_open_total counter that the limiter increments whenever it admits a request because the store was unreachable. Require a corroborating non-zero store-error rate so a single increment does not page, and set a short for: so it fires fast.
Should a 429 count against the limiter's SLO?
No — not when the client genuinely exceeded its quota. Returning 429 is the limiter's job. The SLO bad events are fail-open, fail-closed-while-under-quota, and over-latency decisions. That is why the burn-rate rules query fail-open and store errors, not the 429 rate.
Related
- Alerting & SLOs — the parent guide on healthy-vs-broken 429s and the error-budget model.
- Metrics & Instrumentation — the metrics every rule here queries.
- Prometheus Metrics for Rate Limiting — emit the series these alerts depend on.
- Grafana Rate Limit Dashboards — the panels you open when one of these alerts fires.