A kernel routing and verification system for language models.
A caching layer for verified reasoning patterns.
When a language model solves a task successfully, KernelWeave:
- Stores the reasoning pattern as a typed kernel
- Verifies future outputs against postconditions before caching
- Routes similar prompts to cached kernels
- Accumulates verified competence over time
- Semantic routing — Embedding-based similarity + calibration scoring
- Postcondition verification — Checks outputs against kernel constraints
- Feedback accumulation — Records success/failure for each kernel
- Auto-promotion — High-confidence repeated successes become candidate kernels
- Model-agnostic — Works with any OpenAI/Anthropic/openai-compatible backend
- Runnable training bundle — a pure-Python Kaggle-safe calibration/tracing path that keeps the kernel architecture executable without HF/CUDA wheel drama
Kernels are extracted from prompts + response text, not from observed reasoning traces. The model's actual chain-of-thought, tool calls, and intermediate states aren't captured. That requires structured generation or a separate observation layer.
If you have a "structured comparison" kernel and an "evidence extraction" kernel, there's no mechanism to combine them. Composition over a kernel algebra is an open research problem.
The kernel informs the model via system prompt, but doesn't constrain the output space during generation. The model can still output anything — verification happens after the fact. A frontier system would use kernels as structured decoding constraints during token generation.
"Postcondition verification as a routing signal."
The novel part isn't the routing (that's retrieval) or the kernels (that's program synthesis). It's using verification against formal postconditions to decide whether to trust cached reasoning for future prompts.
Testable claim: For repeated task families, routing + verification beats vanilla RAG on output quality and cost.
- Benchmark — Run on ToolBench, AgentBench, or a custom benchmark with repeatable tasks
- Baselines — Compare against vanilla RAG, BM25 retrieval, no verification
- Metrics — Routing precision/recall, output quality, cost per query
- Ablations — Does verification actually improve routing decisions?
Without numbers, it's a prototype. With numbers, it's a 4-page workshop paper with a clear contribution.
# Initialize kernel store
python -m kernelweave.cli init ./store
python -m kernelweave.cli add-sample ./store
# Run with model backend
python -m kernelweave.cli model run qwen0_5 "compare two artifacts" \
--kernel-store ./store \
--auto-compile
# Verify routing
python -m kernelweave.cli plan ./store "summarize differences between files"prompt → embed → kernel match → execute kernel OR generate
↓
verify output
↓
record feedback
↓
auto-promote if high confidence
kernelweave/runtime.py— Routing + verificationkernelweave/kernel.py— Kernel store + feedback accumulationkernelweave/calibration.py— Logistic regression confidence modelkernelweave/llm/model.py— Model wrapper with kernel-aware promptskernelweave/training/— pure-Python synthetic training / calibration path
High repetition — Customer support, data pipelines, routine analysis
Low repetition — Creative writing, one-off research, exploratory chat
The overhead of compilation, matching, and verification only pays off when the same task family appears many times.
Working prototype. Not frontier. Not a trained model. A useful piece of infrastructure for LLM-based systems with repeated tasks.
For the full pre-restructure debugging trail, see docs/RESTRUCTURING_HISTORY.md.