feat(compiler): add cache_control breakpoints for Anthropic prompt caching#38
Open
pitimon wants to merge 1 commit intoVectifyAI:mainfrom
Open
feat(compiler): add cache_control breakpoints for Anthropic prompt caching#38pitimon wants to merge 1 commit intoVectifyAI:mainfrom
pitimon wants to merge 1 commit intoVectifyAI:mainfrom
Conversation
…ching Compiler reuses base context A (system + doc) across N+M+2 LLM calls per document. Without cache_control markers, every call rebills the full document content as input tokens. Adds two breakpoints: - end of doc_msg: caches (system + doc) for summary, plan, every concept - end of assistant summary: caches (system + doc + summary) for plan and every concept generation call For non-Anthropic providers, the list-of-blocks payload is a valid OpenAI-compatible shape; LiteLLM normalizes cache_control away. Side fix: _llm_call_async now forwards **kwargs for parity with _llm_call (memory observation #82886). Refs VectifyAI#37
Author
Smoke test against live OpenRouter → Anthropic Sonnet 4.5Ingested a fresh OCR'd 21-page Thai procedure document into a short-doc-pipeline KB. Token usage from
Both breakpoints work as designed:
Total tokens cached on this single ingest: ~50,200 across 6 reuse calls. For a typical document with 5+ concept calls, the cached prefix is paid once and reused at the discounted rate for all subsequent calls — matching the Routed via |
This was referenced May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cache_control: {"type": "ephemeral"}breakpoints to the compiler so Anthropic prompt caching can hit on every reuse of the base context.doc_msg: caches(system + doc)across summary, concepts plan, and every concept page call (1 + 1 + N + M calls per document).(system + doc + summary)across the plan call and every concept generation call._llm_call_asyncnow forwards**kwargsfor parity with_llm_call.Why
Compiler architecture is explicitly designed around base-context reuse (per
CLAUDE.md: "Designed around prompt-cache reuse: a single base context A reused across summary → concept-plan → concept-page calls") but no cache markers were emitted, so every call rebilled the full document content. For Anthropic Sonnet 4.5 via OpenRouter or direct, this typically reduces input cost by ~90% on the cached prefix and lowers TTFT after the first call.Compatibility
cache_controlhonored.cache_controlfield is ignored harmlessly._fmt_messageswas extended to handle both string and list-of-blocks content shapes for debug output.Test plan
*args, **kwargs).TestCacheControlclass — 2 tests:test_short_doc_marks_doc_and_summarycaptures messages from sync + async LiteLLM calls and assertscache_controlmarkers are present ondoc_msg(all 3 sync calls, all async calls) and on the assistant summary (plan + concept calls).test_long_doc_marks_doc_messageasserts the breakpoint ondoc_msgfor the long-doc path.cached_tokensinprompt_tokens_detailson calls 2..N (left for reviewer).Out of scope
X-OpenRouter-Cache: trueheader) — different mechanism, evaluated separately.compiler.py(~847 lines, pre-existing >800 condition deepened by ~35 lines) — recommend follow-up extractingcompiler/messages.py.Refs #37