feat(compiler): opt-in OpenRouter Response Caching for compiler LLM calls by pitimon · Pull Request #40 · VectifyAI/OpenKB

pitimon · 2026-05-04T04:39:50Z

Summary

New per-KB config flags response_cache: bool = false and response_cache_ttl: int | None = None.
When enabled and the active model starts with openrouter/, compiler forwards extra_headers containing X-OpenRouter-Cache: true (and optionally X-OpenRouter-Cache-TTL: <seconds>) on every LiteLLM call.
OpenRouter returns identical-payload responses in 80–300 ms with zero token billing (docs) — direct win on the compile-retry path (failed compile re-run) and dev iteration.

Why

openkb add only registers a doc's hash after compilation succeeds. When compilation fails partway, the retry runs every LLM call again (summary → plan → N+M concept pages) with identical prompts. Without Response Caching, every retry rebills full token cost. Same situation for repeated openkb lint and developer iteration loops.

Behaviour

Default OFF. Response caching stores responses on OpenRouter — incompatible with strict zero-data-retention postures (e.g. KBs holding regulated/classified content). Users opt in deliberately.
Headers are emitted only when model.startswith("openrouter/"). Direct Anthropic/OpenAI/etc. requests remain byte-identical to today.
TTL is cast to int() before stringifying, so YAML quoting quirks ("600" vs 600) don't reach the header value.
Complementary to feat(compiler): add cache_control breakpoints for Anthropic prompt caching #38: prompt caching reduces input cost on the cached prefix per call; response caching skips the model entirely on identical-payload re-runs. They compose.

Scope

compile_short_doc, compile_long_doc, _compile_concepts — the only direct LiteLLM callers in the project.
Out of scope: query, chat, linter — those use the OpenAI Agents SDK; routing custom headers through the SDK requires a separate, larger change.

Config example

# .openkb/config.yaml
model: openrouter/anthropic/claude-sonnet-4.5
response_cache: true
response_cache_ttl: 600   # optional, 1..86400 seconds, OpenRouter default 300

Test plan

TestResponseCacheHeaders — 7 unit tests covering disabled, missing key, non-OpenRouter model, OpenRouter+enabled, TTL emit/omit, _build_llm_kwargs packaging.
TestResponseCacheIntegration — 2 end-to-end tests: flag-on forwards extra_headers on every sync LLM call; flag-off (default) emits no extra_headers (regression guard).
Full suite: 244 passed (10 new).
Manual smoke against a real OpenRouter key — flip the flag, run the same openkb add twice, observe X-OpenRouter-Cache-Status: HIT on the second run (left for reviewer).

Depends on

#38 — uses the **kwargs symmetry fix on _llm_call_async. Either merge order works after the simple rebase if #38 lands first.

Refs #39

…ching Compiler reuses base context A (system + doc) across N+M+2 LLM calls per document. Without cache_control markers, every call rebills the full document content as input tokens. Adds two breakpoints: - end of doc_msg: caches (system + doc) for summary, plan, every concept - end of assistant summary: caches (system + doc + summary) for plan and every concept generation call For non-Anthropic providers, the list-of-blocks payload is a valid OpenAI-compatible shape; LiteLLM normalizes cache_control away. Side fix: _llm_call_async now forwards **kwargs for parity with _llm_call (memory observation #82886). Refs VectifyAI#37

…alls When response_cache is enabled in the per-KB config and the active model is routed via openrouter/, compile_short_doc and compile_long_doc forward extra_headers={"X-OpenRouter-Cache": "true", optional X-OpenRouter-Cache-TTL} on every LiteLLM call. OpenRouter then returns a cached response in 80-300ms with zero token billing on identical follow-up requests, which benefits the compile-retry path and repeated lint runs. Default OFF — opt-in only. Response caching stores responses on OpenRouter, which conflicts with strict zero-data-retention postures. Skips header emission when the model is not openrouter/-routed, so direct Anthropic/OpenAI/etc. calls remain byte-identical to before. Scope is intentionally limited to compiler.py (the only direct LiteLLM caller). query/chat/linter route through the OpenAI Agents SDK; threading custom headers there is a separate change. Refs VectifyAI#39 Depends on VectifyAI#38

pitimon · 2026-05-04T04:50:28Z

Smoke test — two-layer verification

L1 — openkb wires the headers through

Enabled in ~/kb-isms-test/.openkb/config.yaml:

response_cache: true
response_cache_ttl: 3600

Ran openkb -v add <doc.pdf> (PageIndex/long-doc path, model openrouter/anthropic/claude-sonnet-4.5). Every LLM call's debug log shows the headers attached:

openkb.agent.compiler DEBUG: LLM kwargs [overview]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [concepts-plan]: {'max_tokens': 1024, 'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [concept: information-security-risk-assessment]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [concept: risk-management-methodology]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [concept: threat-vulnerability-assessment]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [update: information-asset-classification]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [update: isms-implementation]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}

Every call from compile_long_doc and _compile_concepts carries the cache headers. When the flag is left at its default (response_cache: false), no extra_headers appears in any call — verified by the regression test in this PR (TestResponseCacheIntegration::test_flag_off_no_extra_headers).

L2 — OpenRouter actually serves a cached response

Bypassed openkb to isolate the OpenRouter side. Two identical curl POSTs to /api/v1/chat/completions against anthropic/claude-haiku-4.5 with the same headers and payload, ~1 second apart:

	Call 1 (cold)	Call 2 (warm)
`X-OpenRouter-Cache-Status`	MISS	HIT
`X-OpenRouter-Cache-Age`	—	`0`
`prompt_tokens`	22	0
`completion_tokens`	24	0
`total_tokens`	46	0
`cost`	$0.000142	$0
Response body	1 stable completion	byte-identical to call 1

Both calls returned the same message.content. Call 2 paid zero tokens and returned in well under one second, exactly matching the OpenRouter response-caching docs.

Combined effect

Both layers compose. The retry path of openkb add (failed compile → re-run with identical prompts) and re-run lint will now hit OpenRouter's cache on every step instead of paying full token cost. Privacy default is preserved — opt-in only, and only OpenRouter-routed models receive the headers.

itarun.p added 2 commits May 4, 2026 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compiler): opt-in OpenRouter Response Caching for compiler LLM calls#40

feat(compiler): opt-in OpenRouter Response Caching for compiler LLM calls#40
pitimon wants to merge 2 commits intoVectifyAI:mainfrom
pitimon:feat/39-response-cache-opt-in

pitimon commented May 4, 2026

Uh oh!

pitimon commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pitimon commented May 4, 2026

Summary

Why

Behaviour

Scope

Config example

Test plan

Depends on

Uh oh!

pitimon commented May 4, 2026

Smoke test — two-layer verification

L1 — openkb wires the headers through

L2 — OpenRouter actually serves a cached response

Combined effect

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant