Skip to content

feat(compiler): opt-in OpenRouter Response Caching for compiler LLM calls#40

Open
pitimon wants to merge 2 commits intoVectifyAI:mainfrom
pitimon:feat/39-response-cache-opt-in
Open

feat(compiler): opt-in OpenRouter Response Caching for compiler LLM calls#40
pitimon wants to merge 2 commits intoVectifyAI:mainfrom
pitimon:feat/39-response-cache-opt-in

Conversation

@pitimon
Copy link
Copy Markdown

@pitimon pitimon commented May 4, 2026

Summary

  • New per-KB config flags response_cache: bool = false and response_cache_ttl: int | None = None.
  • When enabled and the active model starts with openrouter/, compiler forwards extra_headers containing X-OpenRouter-Cache: true (and optionally X-OpenRouter-Cache-TTL: <seconds>) on every LiteLLM call.
  • OpenRouter returns identical-payload responses in 80–300 ms with zero token billing (docs) — direct win on the compile-retry path (failed compile re-run) and dev iteration.

Why

openkb add only registers a doc's hash after compilation succeeds. When compilation fails partway, the retry runs every LLM call again (summary → plan → N+M concept pages) with identical prompts. Without Response Caching, every retry rebills full token cost. Same situation for repeated openkb lint and developer iteration loops.

Behaviour

  • Default OFF. Response caching stores responses on OpenRouter — incompatible with strict zero-data-retention postures (e.g. KBs holding regulated/classified content). Users opt in deliberately.
  • Headers are emitted only when model.startswith("openrouter/"). Direct Anthropic/OpenAI/etc. requests remain byte-identical to today.
  • TTL is cast to int() before stringifying, so YAML quoting quirks ("600" vs 600) don't reach the header value.
  • Complementary to feat(compiler): add cache_control breakpoints for Anthropic prompt caching #38: prompt caching reduces input cost on the cached prefix per call; response caching skips the model entirely on identical-payload re-runs. They compose.

Scope

  • compile_short_doc, compile_long_doc, _compile_concepts — the only direct LiteLLM callers in the project.
  • Out of scope: query, chat, linter — those use the OpenAI Agents SDK; routing custom headers through the SDK requires a separate, larger change.

Config example

# .openkb/config.yaml
model: openrouter/anthropic/claude-sonnet-4.5
response_cache: true
response_cache_ttl: 600   # optional, 1..86400 seconds, OpenRouter default 300

Test plan

  • TestResponseCacheHeaders — 7 unit tests covering disabled, missing key, non-OpenRouter model, OpenRouter+enabled, TTL emit/omit, _build_llm_kwargs packaging.
  • TestResponseCacheIntegration — 2 end-to-end tests: flag-on forwards extra_headers on every sync LLM call; flag-off (default) emits no extra_headers (regression guard).
  • Full suite: 244 passed (10 new).
  • Manual smoke against a real OpenRouter key — flip the flag, run the same openkb add twice, observe X-OpenRouter-Cache-Status: HIT on the second run (left for reviewer).

Depends on

#38 — uses the **kwargs symmetry fix on _llm_call_async. Either merge order works after the simple rebase if #38 lands first.

Refs #39

itarun.p added 2 commits May 4, 2026 10:07
…ching

Compiler reuses base context A (system + doc) across N+M+2 LLM calls per
document. Without cache_control markers, every call rebills the full
document content as input tokens.

Adds two breakpoints:
- end of doc_msg: caches (system + doc) for summary, plan, every concept
- end of assistant summary: caches (system + doc + summary) for plan and
  every concept generation call

For non-Anthropic providers, the list-of-blocks payload is a valid
OpenAI-compatible shape; LiteLLM normalizes cache_control away.

Side fix: _llm_call_async now forwards **kwargs for parity with _llm_call
(memory observation #82886).

Refs VectifyAI#37
…alls

When response_cache is enabled in the per-KB config and the active model
is routed via openrouter/, compile_short_doc and compile_long_doc forward
extra_headers={"X-OpenRouter-Cache": "true", optional X-OpenRouter-Cache-TTL}
on every LiteLLM call. OpenRouter then returns a cached response in
80-300ms with zero token billing on identical follow-up requests, which
benefits the compile-retry path and repeated lint runs.

Default OFF — opt-in only. Response caching stores responses on
OpenRouter, which conflicts with strict zero-data-retention postures.

Skips header emission when the model is not openrouter/-routed, so direct
Anthropic/OpenAI/etc. calls remain byte-identical to before.

Scope is intentionally limited to compiler.py (the only direct LiteLLM
caller). query/chat/linter route through the OpenAI Agents SDK; threading
custom headers there is a separate change.

Refs VectifyAI#39
Depends on VectifyAI#38
@pitimon
Copy link
Copy Markdown
Author

pitimon commented May 4, 2026

Smoke test — two-layer verification

L1 — openkb wires the headers through

Enabled in ~/kb-isms-test/.openkb/config.yaml:

response_cache: true
response_cache_ttl: 3600

Ran openkb -v add <doc.pdf> (PageIndex/long-doc path, model openrouter/anthropic/claude-sonnet-4.5). Every LLM call's debug log shows the headers attached:

openkb.agent.compiler DEBUG: LLM kwargs [overview]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [concepts-plan]: {'max_tokens': 1024, 'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [concept: information-security-risk-assessment]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [concept: risk-management-methodology]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [concept: threat-vulnerability-assessment]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [update: information-asset-classification]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}
openkb.agent.compiler DEBUG: LLM kwargs [update: isms-implementation]: {'extra_headers': {'X-OpenRouter-Cache': 'true', 'X-OpenRouter-Cache-TTL': '3600'}}

Every call from compile_long_doc and _compile_concepts carries the cache headers. When the flag is left at its default (response_cache: false), no extra_headers appears in any call — verified by the regression test in this PR (TestResponseCacheIntegration::test_flag_off_no_extra_headers).

L2 — OpenRouter actually serves a cached response

Bypassed openkb to isolate the OpenRouter side. Two identical curl POSTs to /api/v1/chat/completions against anthropic/claude-haiku-4.5 with the same headers and payload, ~1 second apart:

Call 1 (cold) Call 2 (warm)
X-OpenRouter-Cache-Status MISS HIT
X-OpenRouter-Cache-Age 0
prompt_tokens 22 0
completion_tokens 24 0
total_tokens 46 0
cost $0.000142 $0
Response body 1 stable completion byte-identical to call 1

Both calls returned the same message.content. Call 2 paid zero tokens and returned in well under one second, exactly matching the OpenRouter response-caching docs.

Combined effect

Both layers compose. The retry path of openkb add (failed compile → re-run with identical prompts) and re-run lint will now hit OpenRouter's cache on every step instead of paying full token cost. Privacy default is preserved — opt-in only, and only OpenRouter-routed models receive the headers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant