Skip to main content
Technical Deep DiveD8 Reliability

Caching Strategies for Agent-Ready APIs: When to Cache, What to Invalidate

AI agents are the most repetitive API clients you will ever have. A single agent completing a task might call your get_services endpoint 10 times in one workflow — once for discovery, once for comparison, once for confirmation, and again every time it reconsiders. Without proper caching, each call hits your server and costs the agent tokens to parse the same response. With proper caching, most of those calls return instantly from cache with zero server load. This is not optimization — it is architecture for the agent economy.

AH
AgentHermes Research
April 15, 202611 min read

Why Agents Are Different: The Repetitive Caller Problem

Human API clients are efficient. A mobile app fetches data once and renders it. A web dashboard loads on page visit and caches in the browser. Humans are lazy callers — they call your API only when they need to.

Agents are the opposite. An agent planning a task calls your API to understand what is available. Then it calls again to verify before making a decision. Then it calls again to confirm before executing. Then it calls again to validate after executing. A single agent completing a single booking might call get_services four times and check_availability six times.

Now multiply that by 100 agents serving 100 users in your area. Without caching, your server handles thousands of redundant requests for the same unchanged data. With proper caching, 90%+ of those calls never reach your origin. The difference is the difference between a $50/month server and a $500/month server — for the same business.

10x
more API calls per task vs human clients
90%+
of calls cacheable with proper headers
304
Not Modified = agent saves tokens
+15
D8 Reliability score from caching

Four Caching Strategies Every Agent-Ready API Needs

There is no single caching strategy. Different endpoints need different approaches based on how often the data changes and how harmful staleness is. Here are the four strategies and when to use each one.

Cache-Control: max-age

Static data that changes infrequently — business info, service menus, team pages, FAQs.

Cache-Control: public, max-age=3600, s-maxage=86400

Agent caches for 1 hour. CDN caches for 24 hours. Agent never calls your server for the same business info twice in an hour.

Savings: Up to 95% of redundant calls for catalog data

ETag + If-None-Match

Data that changes sometimes — pricing, product details, menus. Agent checks if data changed before re-downloading.

ETag: "abc123" → Agent sends If-None-Match: "abc123" → 304 Not Modified (zero body transfer)

Agent sends the ETag it received last time. If your data has not changed, you return 304 with no body. The agent keeps its cached copy and saves tokens on parsing.

Savings: Token cost: 304 response = ~10 tokens vs full response = 500+ tokens

stale-while-revalidate

Data that should be fresh but where a brief stale window is acceptable — availability slots, inventory counts.

Cache-Control: public, max-age=60, stale-while-revalidate=300

Agent gets cached data immediately (fast response) while the CDN revalidates in the background. The agent sees data at most 5 minutes old but never waits for the revalidation.

Savings: Latency: agent gets instant response instead of waiting for your origin server

Cache-Control: no-cache / no-store

Real-time data where staleness is unacceptable — live availability, current inventory, active orders, payment status.

Cache-Control: no-store, must-revalidate

Never cache this. Every request hits your origin server. Use this for endpoints where showing stale data causes real harm — double-booking an appointment or selling out-of-stock items.

Savings: Nothing (that is the point) — freshness matters more than performance here

Endpoint Caching Matrix: What to Cache on Every MCP Tool

Here is a practical reference for common MCP tool endpoints. Copy this matrix and adapt it to your specific business. The principle: cache what does not change per-minute, invalidate what does.

Endpoint
Strategy
Reason
Freshness
get_business_info
max-age: 3600
Address, phone, and hours change rarely
Hourly
get_services / get_products
max-age: 1800 + ETag
Catalog changes weekly at most
30 min
get_pricing
max-age: 300 + ETag
Prices change but not per-minute
5 min
check_availability
no-store
Slots book in real-time, staleness = double booking
Real-time
get_inventory
stale-while-revalidate: 60
Brief staleness acceptable, speed matters
~1 min
book_appointment
no-store
Write operation, never cache
Real-time
get_reviews
max-age: 3600
Reviews do not change frequently
Hourly
get_aftercare
max-age: 86400
Static content, rarely updated
Daily

The pattern is clear: read-heavy endpoints with slowly changing data get aggressive caching. Write endpoints and real-time availability get no caching. Everything in between uses conditional requests or stale-while-revalidate. This is the same pattern that powers the CDN layer of agent-ready infrastructure — the logic is identical whether caching happens at the CDN edge or in the agent client.

How Caching Impacts Your Agent Readiness Score

AgentHermes scans for caching infrastructure across two dimensions: D8 Reliability (does the API perform consistently under agent load?) and D9 Agent Experience (is the API designed with agent callers in mind?). Proper cache headers are one of the strongest signals of both.

Cache-Control headers present

D8 Reliability score +8-12 points

AgentHermes scans for proper cache headers on API responses. Their presence indicates a mature, agent-considerate API.

ETag support on GET endpoints

D8 Reliability score +5-8 points

Conditional request support tells AgentHermes the API is designed for efficient repeated access — exactly what agents do.

Consistent cache behavior

D9 Agent Experience score +3-5 points

When cache headers are consistent across endpoints (not random or missing on some), it signals intentional API design.

304 Not Modified responses

D8 Reliability score +3-5 points

Actually returning 304 when data has not changed (not just sending ETags) proves the caching layer works end-to-end.

Total impact: Proper caching can add 19-30 points to your Agent Readiness Score across D8 and D9. For a business scoring 40 (Bronze), adding cache headers alone could push them to 60+ (Silver). This is one of the highest-leverage improvements in the scoring framework — and one of the easiest to implement. Most API frameworks support cache headers with a single middleware addition.

Caching Anti-Patterns That Hurt Agent Readiness

Bad caching is worse than no caching. Here are the patterns we see in scans that actively hurt agent readiness — and what to do instead.

Cache-Control: no-cache on everything

Treating all endpoints as real-time when 80% of your data changes daily at most. This forces every agent call to your origin, multiplying server load by 10x unnecessarily.

Fix: Audit each endpoint. Only truly real-time data needs no-cache.

Missing ETag on large responses

Sending the full 50KB product catalog every time when nothing changed. The agent re-parses identical JSON, burning tokens and adding latency.

Fix: Add ETag based on content hash. Return 304 when unchanged.

Inconsistent headers across endpoints

Cache-Control on /products but not on /services. ETags on /info but not on /pricing. Inconsistency signals unintentional caching rather than designed caching.

Fix: Add caching middleware at the framework level. Every GET endpoint gets a strategy.

Caching personalized responses

Caching responses that include user-specific data (account details, order history) with public cache headers. This leaks data between agent sessions.

Fix: Use Cache-Control: private for authenticated responses. Public only for public data.

The Cost Math: What Proper Caching Saves

Consider a local restaurant with an MCP server handling 1,000 agent interactions per day. Each interaction calls an average of 8 endpoints: business info, menu, hours, availability (x2), pricing, booking, and confirmation. That is 8,000 API calls per day.

Without caching: all 8,000 calls hit your origin server. At 200ms per call, that is 26 minutes of compute time per day. Your server needs to handle 5-6 concurrent requests during peak hours.

With proper caching: business info, menu, hours, and pricing are cached at the CDN. Only availability and booking hit your origin — about 2,500 calls per day, a 69% reduction. The cached calls respond in under 10ms from the CDN edge. Your server handles peak loads that are one-third the size.

For the agent side, each cached response saves the agent from parsing the same JSON again. A 304 Not Modified response is roughly 10 tokens to process versus 500+ tokens for a full response. Across 5,500 cached calls per day, that is 2.7 million tokens saved — real money for agents using latency-sensitive workflows.

Frequently Asked Questions

Do AI agents actually respect Cache-Control headers?

The leading AI agent frameworks (LangChain, AutoGPT, CrewAI) are increasingly implementing HTTP caching. Even when agents themselves do not cache, the CDN layer and any proxy between the agent and your API will. Proper cache headers benefit the entire stack, not just the agent client. Additionally, agent orchestration platforms cache tool responses to avoid redundant calls within a single task — your headers tell them how long that cache is valid.

Should I cache differently for agents versus human API clients?

No. Good caching strategy is good caching strategy regardless of the client. The difference is that agents are more aggressive callers — they may hit the same endpoint 10 times in a single task flow where a human would call it once. This makes proper caching more impactful for agent traffic but the headers and strategies are identical. Do not try to detect agent user-agents and serve different cache headers.

What about Vary headers for agent-specific caching?

Use Vary: Accept to cache different representations (JSON vs HTML) separately. Use Vary: Authorization when responses differ per API key. Do not use Vary: User-Agent — this destroys cache hit rates because every agent framework sends a different user-agent string. The goal is to maximize cache hits, and User-Agent variation works against that.

How does caching interact with rate limiting?

They are complementary. Caching reduces the number of requests that hit your rate limiter. An agent that gets a cached 304 response does not count against rate limits on most API gateways. This means proper caching effectively increases your rate limit capacity without changing the limit — agents can do more work with fewer real requests.

My data changes every few seconds. Can I still use caching?

Yes, with stale-while-revalidate. Even a 10-second max-age with a 60-second stale-while-revalidate window means agents get instant responses for the common case while your origin only gets hit once per 10 seconds per unique resource. For truly real-time data (live auction prices, active order tracking), use no-store and consider WebSocket or SSE instead of polling.


Check your API caching score

See how your API scores on D8 Reliability and D9 Agent Experience. AgentHermes scans for Cache-Control headers, ETag support, and response consistency — free.


Share this article: