Skip to content
V VazDEng
llm

Prompt caching: the one-line change that cuts 90% of LLM cost in production

cache_control ephemeral, a 5-minute TTL, and how one line dropped 75% of the cost of my nightly pipeline.

By Thais Vaz 11 Jun · 2026 4 min read PT · EN
Prompt caching: the one-line change that cuts 90% of LLM cost in production

18 thousand tokens. That was the cost of every run of my news pipeline with 6 parallel sub-agents. After one line of code, it became 4,500. Same model. Same prompt. Same output. I just turned on the cache.

The feature has been in the Anthropic API for over a year. Most teams running LLMs in production still haven’t turned it on. I myself ran for months paying full price before actually reading my invoice. It’s the highest return per minute of work I know of today.

Why LLM cost in production is prefix

Every API call sends 4 things: system prompt, few-shots, context, and the question. In a real pipeline, the first 3 add up to 80 to 95 percent of the tokens, and they repeat on every call. The question changes. The rest is prefix.

Without cache, you pay for the entire prefix every time. In a pipeline running dozens or hundreds of times an hour, that becomes the bill. In a pipeline with parallel fan-out (several sub-agents sharing the same system prompt), it becomes the bill times the number of sub-agents.

With cache, you pay for the prefix once (cache write), then only the delta of each new call (cache read). A cache read costs about 10% of the normal input price.

Anatomy of a call: system prompt, few-shots and context are 80-95% of the tokens and repeat; only the question changes

How Anthropic’s cache works

You mark a block of the prompt with cache_control: ephemeral. Simplified example:

"system": [
  {
    "type": "text",
    "text": "<long, stable system prompt here>",
    "cache_control": {"type": "ephemeral"}
  }
]

Default TTL is 5 minutes. Next call inside that window: the cached prefix is read at 10% of the normal price. Anthropic also offers a 1-hour TTL as a paid option, useful for more spaced-out workflows.

The API returns 2 metrics you need to monitor:

No model change, no prompt rewrite. Just flag what’s cacheable.

A real benchmark from my daily news pipeline

The number in the opening comes from a pipeline I built and maintain: my daily news skill, running every day at 8am. It fires 6 parallel sub-agents: data engineering, AI, investing, crypto, local politics, international politics. Each one carries a fixed system prompt of roughly 3 thousand tokens with tone rules, output format, prioritized sources, and synthesis style.

Without cache, the bill I was paying is direct math:

With cache:

In a more aggressive production pipeline (running dozens of times an hour with larger prefixes), the cut reaches 90%.

Real benchmark: 18 thousand tokens per run without cache vs 4,500 with cache, a 75% cut

Where it shines, where it doesn’t

Shines:

Doesn’t shine:

Where the cache shines: fixed prefix, fan-out, loops, large documents. Where it doesn’t: one-shot, unstable prompt, cadence beyond the TTL

Caveats that kill the gain if you don’t know them:

  1. A cache write is slower than a normal call. You pay once in latency, you win on every call after. In a nightly pipeline that’s irrelevant. In an interactive chat, it matters.
  2. Don’t cache PII or sensitive data without auditing first. Anthropic’s cache is per-account, but the principle stands.
  3. The 5-minute TTL is a short window. If your job re-runs the pipeline every 10 minutes, the cache never hits. For those cases, use the 1-hour TTL.
  4. You only see the gain if you monitor the 2 metrics. A timestamp at the top of the system prompt is enough for the prefix to never cache, and without watching cache_read you think you turned it on and you didn’t.

It’s not micro-optimization. It’s architecture.

Whoever is paying 100% of the price of every call because “there was no time to configure it” is accumulating debt with Anthropic every month. In a production pipeline with serious volume, that becomes thousands of dollars a year. For one line of code.

The rule I now follow in everything I build: structure the prompt in layers. Stable first (cacheable), volatile last. Mark the stable part with cache_control: ephemeral. Monitor cache_creation and cache_read. Pay once, read many.

It’s the ABC. And there are still teams calling this “advanced optimization”.