feat(compiler): add cache_control breakpoints for Anthropic prompt caching by pitimon · Pull Request #38 · VectifyAI/OpenKB

pitimon · 2026-05-04T03:08:27Z

Summary

Adds two cache_control: {"type": "ephemeral"} breakpoints to the compiler so Anthropic prompt caching can hit on every reuse of the base context.
Breakpoint 1 — end of doc_msg: caches (system + doc) across summary, concepts plan, and every concept page call (1 + 1 + N + M calls per document).
Breakpoint 2 — end of assistant summary: caches (system + doc + summary) across the plan call and every concept generation call.
Side fix: _llm_call_async now forwards **kwargs for parity with _llm_call.

Why

Compiler architecture is explicitly designed around base-context reuse (per CLAUDE.md: "Designed around prompt-cache reuse: a single base context A reused across summary → concept-plan → concept-page calls") but no cache markers were emitted, so every call rebilled the full document content. For Anthropic Sonnet 4.5 via OpenRouter or direct, this typically reduces input cost by ~90% on the cached prefix and lowers TTFT after the first call.

Compatibility

Anthropic / OpenRouter→Anthropic: cache_control honored.
OpenAI: list-of-blocks content is a valid OpenAI-compatible shape; cache_control field is ignored harmlessly.
Other providers via LiteLLM: LiteLLM normalizes / strips unknown fields.

_fmt_messages was extended to handle both string and list-of-blocks content shapes for debug output.

Test plan

Existing 232 tests pass unchanged (mocks accept *args, **kwargs).
New TestCacheControl class — 2 tests:
- test_short_doc_marks_doc_and_summary captures messages from sync + async LiteLLM calls and asserts cache_control markers are present on doc_msg (all 3 sync calls, all async calls) and on the assistant summary (plan + concept calls).
- test_long_doc_marks_doc_message asserts the breakpoint on doc_msg for the long-doc path.
Full suite: 234 passed (1.05s).
Manual smoke against a real Anthropic API key: observe cached_tokens in prompt_tokens_details on calls 2..N (left for reviewer).

Out of scope

OpenRouter Response Caching (X-OpenRouter-Cache: true header) — different mechanism, evaluated separately.
Splitting compiler.py (~847 lines, pre-existing >800 condition deepened by ~35 lines) — recommend follow-up extracting compiler/messages.py.

Refs #37

…ching Compiler reuses base context A (system + doc) across N+M+2 LLM calls per document. Without cache_control markers, every call rebills the full document content as input tokens. Adds two breakpoints: - end of doc_msg: caches (system + doc) for summary, plan, every concept - end of assistant summary: caches (system + doc + summary) for plan and every concept generation call For non-Anthropic providers, the list-of-blocks payload is a valid OpenAI-compatible shape; LiteLLM normalizes cache_control away. Side fix: _llm_call_async now forwards **kwargs for parity with _llm_call (memory observation #82886). Refs VectifyAI#37

pitimon · 2026-05-04T04:34:55Z

Smoke test against live OpenRouter → Anthropic Sonnet 4.5

Ingested a fresh OCR'd 21-page Thai procedure document into a short-doc-pipeline KB. Token usage from openkb -v add (verbose mode prints per-call usage):

Step	input tokens	cached tokens	hit rate
summary (call 1)	6,990	0	miss (cache being written)
concepts-plan	9,502	6,987	73.5% — breakpoint 1 (system + doc)
concept: preventive-maintenance	8,745	8,542	97.7% — breakpoint 2 (+ summary)
concept: network-management	8,741	8,542	97.7%
concept: network-security-zones	8,748	8,542	97.7%
update: change-management	12,603	8,542	67.8% (existing concept body extends prompt)
update: configuration-management	13,736	8,542	62.2%

Both breakpoints work as designed:

End of doc_msg — every call from concepts-plan onward shows cached >= 6,987, confirming the (system + doc) prefix is being reused.
End of assistant summary — every concept-generation call shows cached = 8,542 (≈ 1,555 tokens beyond breakpoint 1), confirming the (system + doc + summary) prefix is reused across the N+M concept-page calls.

Total tokens cached on this single ingest: ~50,200 across 6 reuse calls. For a typical document with 5+ concept calls, the cached prefix is paid once and reused at the discounted rate for all subsequent calls — matching the ~90% savings expected for Anthropic prompt caching.

Routed via openrouter/anthropic/claude-sonnet-4.5; LiteLLM passes the cache_control markers through to Anthropic transparently.

This was referenced May 4, 2026

feat: opt-in OpenRouter Response Caching for compiler retry path #39

Open

feat(compiler): opt-in OpenRouter Response Caching for compiler LLM calls #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compiler): add cache_control breakpoints for Anthropic prompt caching#38

feat(compiler): add cache_control breakpoints for Anthropic prompt caching#38
pitimon wants to merge 1 commit intoVectifyAI:mainfrom
pitimon:feat/37-cache-control-prompt-caching

pitimon commented May 4, 2026

Uh oh!

pitimon commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pitimon commented May 4, 2026

Summary

Why

Compatibility

Test plan

Out of scope

Uh oh!

pitimon commented May 4, 2026

Smoke test against live OpenRouter → Anthropic Sonnet 4.5

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant