Skip to content

feat(compiler): add cache_control breakpoints for Anthropic prompt caching#38

Open
pitimon wants to merge 1 commit intoVectifyAI:mainfrom
pitimon:feat/37-cache-control-prompt-caching
Open

feat(compiler): add cache_control breakpoints for Anthropic prompt caching#38
pitimon wants to merge 1 commit intoVectifyAI:mainfrom
pitimon:feat/37-cache-control-prompt-caching

Conversation

@pitimon
Copy link
Copy Markdown

@pitimon pitimon commented May 4, 2026

Summary

  • Adds two cache_control: {"type": "ephemeral"} breakpoints to the compiler so Anthropic prompt caching can hit on every reuse of the base context.
  • Breakpoint 1 — end of doc_msg: caches (system + doc) across summary, concepts plan, and every concept page call (1 + 1 + N + M calls per document).
  • Breakpoint 2 — end of assistant summary: caches (system + doc + summary) across the plan call and every concept generation call.
  • Side fix: _llm_call_async now forwards **kwargs for parity with _llm_call.

Why

Compiler architecture is explicitly designed around base-context reuse (per CLAUDE.md: "Designed around prompt-cache reuse: a single base context A reused across summary → concept-plan → concept-page calls") but no cache markers were emitted, so every call rebilled the full document content. For Anthropic Sonnet 4.5 via OpenRouter or direct, this typically reduces input cost by ~90% on the cached prefix and lowers TTFT after the first call.

Compatibility

  • Anthropic / OpenRouter→Anthropic: cache_control honored.
  • OpenAI: list-of-blocks content is a valid OpenAI-compatible shape; cache_control field is ignored harmlessly.
  • Other providers via LiteLLM: LiteLLM normalizes / strips unknown fields.

_fmt_messages was extended to handle both string and list-of-blocks content shapes for debug output.

Test plan

  • Existing 232 tests pass unchanged (mocks accept *args, **kwargs).
  • New TestCacheControl class — 2 tests:
    • test_short_doc_marks_doc_and_summary captures messages from sync + async LiteLLM calls and asserts cache_control markers are present on doc_msg (all 3 sync calls, all async calls) and on the assistant summary (plan + concept calls).
    • test_long_doc_marks_doc_message asserts the breakpoint on doc_msg for the long-doc path.
  • Full suite: 234 passed (1.05s).
  • Manual smoke against a real Anthropic API key: observe cached_tokens in prompt_tokens_details on calls 2..N (left for reviewer).

Out of scope

  • OpenRouter Response Caching (X-OpenRouter-Cache: true header) — different mechanism, evaluated separately.
  • Splitting compiler.py (~847 lines, pre-existing >800 condition deepened by ~35 lines) — recommend follow-up extracting compiler/messages.py.

Refs #37

…ching

Compiler reuses base context A (system + doc) across N+M+2 LLM calls per
document. Without cache_control markers, every call rebills the full
document content as input tokens.

Adds two breakpoints:
- end of doc_msg: caches (system + doc) for summary, plan, every concept
- end of assistant summary: caches (system + doc + summary) for plan and
  every concept generation call

For non-Anthropic providers, the list-of-blocks payload is a valid
OpenAI-compatible shape; LiteLLM normalizes cache_control away.

Side fix: _llm_call_async now forwards **kwargs for parity with _llm_call
(memory observation #82886).

Refs VectifyAI#37
@pitimon
Copy link
Copy Markdown
Author

pitimon commented May 4, 2026

Smoke test against live OpenRouter → Anthropic Sonnet 4.5

Ingested a fresh OCR'd 21-page Thai procedure document into a short-doc-pipeline KB. Token usage from openkb -v add (verbose mode prints per-call usage):

Step input tokens cached tokens hit rate
summary (call 1) 6,990 0 miss (cache being written)
concepts-plan 9,502 6,987 73.5% — breakpoint 1 (system + doc)
concept: preventive-maintenance 8,745 8,542 97.7% — breakpoint 2 (+ summary)
concept: network-management 8,741 8,542 97.7%
concept: network-security-zones 8,748 8,542 97.7%
update: change-management 12,603 8,542 67.8% (existing concept body extends prompt)
update: configuration-management 13,736 8,542 62.2%

Both breakpoints work as designed:

  1. End of doc_msg — every call from concepts-plan onward shows cached >= 6,987, confirming the (system + doc) prefix is being reused.
  2. End of assistant summary — every concept-generation call shows cached = 8,542 (≈ 1,555 tokens beyond breakpoint 1), confirming the (system + doc + summary) prefix is reused across the N+M concept-page calls.

Total tokens cached on this single ingest: ~50,200 across 6 reuse calls. For a typical document with 5+ concept calls, the cached prefix is paid once and reused at the discounted rate for all subsequent calls — matching the ~90% savings expected for Anthropic prompt caching.

Routed via openrouter/anthropic/claude-sonnet-4.5; LiteLLM passes the cache_control markers through to Anthropic transparently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant