Skip to main content
Dimensions Deep DiveD8 Reliability · D9 Agent Experience

Rate Limiting for AI Agents: Why X-RateLimit-Remaining Is the Most Important Header

Rate limiting for AI agents is different than for humans. Humans retry manually; agents need machine-readable guidance. Stripe’s 429 response includes all four required headers — that is why they score 68. Without these headers, agents either crash or DDoS your API. Here is the four-header contract and why token buckets beat fixed windows in the agent economy.

AH
AgentHermes Research
April 15, 202612 min read

Why Agent Rate Limits Are Different From Human Rate Limits

A human who hits a rate limit sees “too many requests” on a web page, shrugs, gets coffee, and tries in 15 minutes. The retry happens when a human decides to come back — always many minutes later, never in a tight loop.

An AI agent does not get coffee. An AI agent is in a while-loop. It hit your API once, got a 429, and by default its next line of code retries immediately. In 60 seconds that agent has sent thousands of requests to an endpoint that is already saturated. Congratulations: you just invented a self-inflicted DDoS vector.

The only difference between a legitimate agent and a denial-of-service attack is whether the 429 response carries machine-readable guidance. With X-RateLimit-Remaining and Retry-After, a well-written agent throttles itself before hitting the limit, sleeps the exact number of seconds you told it to, and comes back under budget. Without those headers, the agent either hammers or gives up — both are bad outcomes for the API owner.

4
headers required
23%
combined D8+D9 weight
500
businesses scanned
68
Stripe ARS score

The Four Required Headers

Every response — not just 429s — should carry the first three. The fourth is specific to rejection. Together they form the contract between your API and every agent that will ever call it.

X-RateLimit-Limit

The maximum number of requests the agent is allowed in the current window. Static per API key or plan. Without this header, an agent has no way to size its concurrency — it has to probe blind.

X-RateLimit-Limit: 100

X-RateLimit-Remaining

How many requests are still allowed in this window. This is the single most important header in the agent economy. Agents watch this value drop and throttle themselves before they ever hit a 429.

X-RateLimit-Remaining: 23

X-RateLimit-Reset

Unix timestamp (or seconds until reset) when the current window rolls over. Lets agents plan: "I have 23 requests left and the window resets in 14 seconds, so I can burst if needed." Without this, agents guess.

X-RateLimit-Reset: 1744742400

Retry-After

Only sent on 429 responses. Tells the agent exactly how many seconds to wait before the next attempt. Replaces the ambiguity of "try again later" with a precise machine-readable number.

Retry-After: 7

All four headers are zero-friction for the client. Every HTTP library in every language reads response headers by default — no SDK update required. The agent side of the contract is a 20-line wrapper that watches X-RateLimit-Remaining and pauses when it hits a low-water mark.

Token Bucket Beats Fixed Windows for Agent Traffic

There are two common rate-limit algorithms. Fixed windows (“100 requests per minute”) are simple to implement and terrible for agents. Token buckets are slightly harder to implement and drastically better for agent workloads.

A fixed-window limit punishes the natural shape of agent work. Agents batch: they fetch 50 items, enrich each one with a follow-up call, then idle. Under a fixed-window limit they blow the budget in the first 10 seconds of the minute and sit idle for 50. Under a token bucket they burst to the bucket capacity, refill continuously at the limit rate, and never waste capacity.

Behavior
Fixed Window
Token Bucket
Bursting
All-or-nothing at window edge
Allowed up to bucket capacity
Steady-state throughput
Identical to bucket
Identical to fixed
Idle recovery
Wasted capacity
Refills the bucket
Agent ergonomics
Thundering herd at reset
Smooth self-throttling
Implementation cost
Counter + timestamp
Counter + last-refill time

Scoring implication: AgentHermes does not score the algorithm itself — we score the observable outputs (headers + structured 429s). But the top scorers (Stripe 68, GitHub 67, Slack 68, Resend 75) all use token buckets. It is a correlation worth noticing: the APIs that ship good agent ergonomics also pick the algorithm that matches agent traffic shapes.

Per-API-Key Limits Isolate Noisy Neighbors

Rate limiting at the IP level made sense when your clients were browsers. Agents break that assumption. Every agent shares an IP pool with thousands of other agents (Lambda, Cloudflare Workers, residential proxy networks). An IP-level limit means a single noisy agent on the same shared egress can starve every other agent through no fault of theirs.

Per-API-key limits scope the damage. Each caller gets their own bucket. A runaway loop in one integration cannot affect any other. This is how Stripe, Resend, and GitHub all handle it: the rate limit is tied to the secret key presented in Authorization, not to the connection origin.

Per-key limits earn D7 Security credit

AgentHermes scans for evidence of per-key isolation in rate-limit docs. IP-level-only is flagged as a D7 Security weakness — it creates a denial-of-service amplifier across tenants.

Expose the limit in your docs

Publish the per-plan or per-key limit numbers. Free: 100 req/min. Pro: 1000 req/min. Enterprise: negotiated. Ambiguity forces agents to over-probe just to figure out the envelope.

Return 429 with JSON, not HTML

{ "error": "rate_limited", "code": "too_many_requests", "message": "...", "request_id": "req_..." }. HTML error pages are a D9 Agent Experience penalty — they waste LLM tokens to parse.

Document exponential backoff guidance

On your /docs/rate-limits page, include the recommended backoff formula: min(2^n * base, max_delay) with jitter. Remove the ambiguity of "try again later" — give agents the exact math.

How AgentHermes Scores Your Rate-Limit Surface

The scanner issues a handful of requests to likely API endpoints and inspects response headers. It also reads your /docs/rate-limits page if one exists. Six signals feed the rate-limit component across D8 and D9.

1

Presence of X-RateLimit-Limit

Returned on every response. Either X-RateLimit-Limit or the RFC-draft RateLimit-Limit form is accepted. This is the single biggest machine-readable trust signal.

2

Presence of X-RateLimit-Remaining

The self-throttle signal. Agents use this to pace themselves. Without it they either over-send (DDoS) or under-send (wasted capacity).

3

Presence of X-RateLimit-Reset

Tells agents when the window rolls over. Unix timestamp preferred. Seconds-until-reset acceptable. Hours (without unit) is an anti-pattern we flag.

4

Retry-After on 429 responses

We deliberately probe one endpoint past the rate limit to trigger a 429. A missing Retry-After header here is a 2-point D8 penalty.

5

Structured JSON 429 body

{ error, code, message, request_id } at minimum. HTML error pages or plaintext lose D9 credit.

6

Public /docs/rate-limits page

A discoverable documentation page describing limits, algorithm, per-key scoping, and recommended backoff. Cross-counts toward D1 Discoverability and D6 Data Quality.

The combined lift from shipping all six signals is roughly 4-5 points on the total Agent Readiness Score for an API-centric business. Middleware can emit the three presence headers in about 15 lines of code — no new infrastructure, no database changes, no schema migrations.

If you only ship one of the four headers first, ship X-RateLimit-Remaining. It is the header that separates self-throttling agents from accidental DDoS. Everything else is refinement on top of that foundation.

Frequently Asked Questions

Why is rate limiting for AI agents different than for humans?

A human who hits a rate limit sees an error page, waits a few minutes, and tries again. An AI agent is in a loop. If it does not know when to retry, it retries immediately — often within the same second — and makes the rate-limit situation worse. The only difference between a legitimate agent and a DDoS is whether the rate-limit response carries machine-readable guidance. With X-RateLimit-Remaining and Retry-After, the agent self-throttles. Without them, it hammers.

What does Stripe do that earns them 68 on Agent Readiness?

Stripe emits all four headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After) on every response — not just 429s. Their 429 body is a structured JSON envelope with an error code and a human-readable message. They use a token bucket algorithm with per-API-key limits, which means one noisy integration cannot starve the rest of their users. All of that is documented on a public /docs/rate-limits page. That is why Stripe scores high on D8 Reliability (13% weight) and D9 Agent Experience (10% weight).

What is a token bucket algorithm and why does it beat fixed windows for agents?

A token bucket gives each caller a bucket that holds N tokens. Every request consumes one token. The bucket refills at a constant rate (e.g., 10 tokens per second). This lets an agent burst briefly — say, process a batch of 50 items in a few seconds — and then catch its breath while the bucket refills. A fixed-window limit (100 requests per minute) punishes agents that batch work naturally and wastes capacity during quiet minutes. Stripe, GitHub, and most top scorers in our 500-business scan use token buckets.

Why does AgentHermes weight rate-limit headers under D8 Reliability (13%) and D9 Agent Experience (10%)?

Rate limiting sits at the intersection of two dimensions. D8 Reliability measures whether agents can depend on your API over time — without predictable rate limits, they cannot. D9 Agent Experience measures whether agents can handle your responses programmatically — without machine-readable headers, the 429 is a dead end. Together these two dimensions are 23% of the total Agent Readiness Score. Exposing four headers is a single-afternoon change that lifts both.

Should 429 responses have a structured JSON body?

Yes. The body should be valid JSON with at minimum: { "error": "rate_limited", "code": "too_many_requests", "message": "You have exceeded the per-key rate limit", "request_id": "req_..." }. An HTML error page or bare plaintext is a scoring penalty in D9 Agent Experience — agents cannot parse it without an LLM round-trip, which wastes budget. Stripe's 429 body is the canonical reference implementation.


Check your rate-limit surface in 60 seconds

AgentHermes probes your API, inspects response headers, and scores your rate-limit ergonomics on D8 and D9. Free, fast, no signup.


Share this article: