feat(compiler): opt-in OpenRouter Response Caching for compiler LLM calls#40
feat(compiler): opt-in OpenRouter Response Caching for compiler LLM calls#40pitimon wants to merge 2 commits intoVectifyAI:mainfrom
Conversation
…ching Compiler reuses base context A (system + doc) across N+M+2 LLM calls per document. Without cache_control markers, every call rebills the full document content as input tokens. Adds two breakpoints: - end of doc_msg: caches (system + doc) for summary, plan, every concept - end of assistant summary: caches (system + doc + summary) for plan and every concept generation call For non-Anthropic providers, the list-of-blocks payload is a valid OpenAI-compatible shape; LiteLLM normalizes cache_control away. Side fix: _llm_call_async now forwards **kwargs for parity with _llm_call (memory observation #82886). Refs VectifyAI#37
…alls
When response_cache is enabled in the per-KB config and the active model
is routed via openrouter/, compile_short_doc and compile_long_doc forward
extra_headers={"X-OpenRouter-Cache": "true", optional X-OpenRouter-Cache-TTL}
on every LiteLLM call. OpenRouter then returns a cached response in
80-300ms with zero token billing on identical follow-up requests, which
benefits the compile-retry path and repeated lint runs.
Default OFF — opt-in only. Response caching stores responses on
OpenRouter, which conflicts with strict zero-data-retention postures.
Skips header emission when the model is not openrouter/-routed, so direct
Anthropic/OpenAI/etc. calls remain byte-identical to before.
Scope is intentionally limited to compiler.py (the only direct LiteLLM
caller). query/chat/linter route through the OpenAI Agents SDK; threading
custom headers there is a separate change.
Refs VectifyAI#39
Depends on VectifyAI#38
Smoke test — two-layer verificationL1 — openkb wires the headers throughEnabled in response_cache: true
response_cache_ttl: 3600Ran Every call from L2 — OpenRouter actually serves a cached responseBypassed openkb to isolate the OpenRouter side. Two identical curl POSTs to
Both calls returned the same Combined effectBoth layers compose. The retry path of |
Summary
response_cache: bool = falseandresponse_cache_ttl: int | None = None.openrouter/, compiler forwardsextra_headerscontainingX-OpenRouter-Cache: true(and optionallyX-OpenRouter-Cache-TTL: <seconds>) on every LiteLLM call.Why
openkb addonly registers a doc's hash after compilation succeeds. When compilation fails partway, the retry runs every LLM call again (summary → plan → N+M concept pages) with identical prompts. Without Response Caching, every retry rebills full token cost. Same situation for repeatedopenkb lintand developer iteration loops.Behaviour
model.startswith("openrouter/"). Direct Anthropic/OpenAI/etc. requests remain byte-identical to today.int()before stringifying, so YAML quoting quirks ("600"vs600) don't reach the header value.Scope
compile_short_doc,compile_long_doc,_compile_concepts— the only direct LiteLLM callers in the project.query,chat,linter— those use the OpenAI Agents SDK; routing custom headers through the SDK requires a separate, larger change.Config example
Test plan
TestResponseCacheHeaders— 7 unit tests covering disabled, missing key, non-OpenRouter model, OpenRouter+enabled, TTL emit/omit,_build_llm_kwargspackaging.TestResponseCacheIntegration— 2 end-to-end tests: flag-on forwardsextra_headerson every sync LLM call; flag-off (default) emits noextra_headers(regression guard).openkb addtwice, observeX-OpenRouter-Cache-Status: HITon the second run (left for reviewer).Depends on
#38 — uses the
**kwargssymmetry fix on_llm_call_async. Either merge order works after the simple rebase if #38 lands first.Refs #39