continuity-benchmarks

Reproducible benchmarks for structured-knowledge retrieval in long-horizon AI coding agents.

The repo asks a single question:

Does retrieval keyed on what the agent is about to do improve the correctness of its actions, more than retrieval keyed on the user's prompt?

Three runners measure this against fictional codebases with hand-authored architectural decisions, plus the public LongMemEval dataset for an external cross-check. Everything is bring-your-own-API-keys and reproducible from npm / python scripts.

Quickstart

# 1. Clone + install
git clone https://github.com/Alienfader/continuity-benchmarks.git
cd continuity-benchmarks
npm install

# 2. Set API keys (pick whichever benchmarks you want to run)
cp .env.example .env
# edit .env:
#   ANTHROPIC_API_KEY=sk-ant-...   # required for Sonnet judge / agent
#   OPENAI_API_KEY=sk-proj-...     # required for GPT-4o agent / judge
#   GOOGLE_API_KEY=AIzaSy-...      # required for Gemini Flash / inter-judge

# 3. Run a smoke test (no API calls, uses mock client)
npm run test:smoke

# 4. Run the smallest real benchmark (~$0.10, ~13 min, Gemini Flash)
#    See LongMemEval section below for setup.

Benchmarks

Runner	Question it answers	Output	Cost
`runners/action-alignment.ts`	Does the agent obey architectural constraints when proposing an action?	LLM-judge 1–10 across 30 prompts × 3 conditions per cell	~$1 / cell
`runners/recall-over-time.ts`	Does it retain decisions across 7 noisy sessions?	Cosine similarity vs ground truth × 4 conditions per cell	~$1 / cell
`runners/longmemeval.ts`	Does it work on the public LongMemEval-S benchmark?	Per-question accuracy (Gemini autoeval, optional GPT-4o re-judge)	~$0.10 / 50 questions

ID-RAG matrix (fixtures + recall + alignment)

# Single-cell smoke test (mock model, no API calls)
npm run test:smoke-v2

# Full v2 cross-corpus matrix: 2 fixtures × 2 models × 3 runs × 2 runners = 24 cells
# Wall: ~14h. Cost: ~$25-30. Set ANTHROPIC_API_KEY + OPENAI_API_KEY.
npm run bench:matrix-v2

# After the matrix completes, run the analysis script
npm run analyze:v2

# Optional: inter-judge replication with Gemini-2.5-flash (~2h, ~$2)
npm run rejudge:cross-corpus-v2

The matrix runners have overwrite guards — re-running against a populated reports/id-rag-parity-v2/ is a no-op by default. Set FORCE=1 to override.

LongMemEval-S subsample (Gemini Flash, free-tier-friendly)

# 1. Download the raw dataset (280 MB) from HuggingFace
curl -sSLo fixtures/longmemeval/longmemeval_s_cleaned.json \
  https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json

# 2. Generate a deterministic 50-question stratified subsample
python3 scripts/longmemeval-subsample.py

# 3. Run the benchmark (~13 min wall, free tier on Gemini Flash)
npm run bench:longmemeval -- \
  --sample fixtures/longmemeval/sample-50.json \
  --output reports/longmemeval/run-1

# 4. Optional: re-judge with GPT-4o for leaderboard-comparable numbers (~4 min, ~$1)
npm run rejudge:longmemeval -- --run-dir reports/longmemeval/run-1

See fixtures/longmemeval/README.md for the data setup details + caveats versus the official LongMemEval leaderboard.

Custom retrieval system (bring your own)

The systems/ directory accepts adapters that plug into the same fixtures + scoring scripts. Drop an adapter at systems/<name>/index.ts exporting a RetrievalSystem (one init(decisions) → Retriever function), then:

npm run bench:custom -- --runner=recall --system=my-adapter --fixture=data-pipeline --model=gpt-4o-mini
npm run bench:compare    # produces side-by-side comparison report

See systems/README.md for the adapter contract and an example BM25 implementation.

What the benchmarks have found

From the v2 cross-corpus matrix (n=12 cells, paired Wilcoxon):

Retrieval >> no retrieval. Action alignment Cohen's d = 8.94. Recall d = 11.38.
Per-question targeted retrieval >> blanket retrieval. Recall d = 5.83 — the lift comes from what you retrieve, not when.
Injection timing does not matter (M2 ablation). Holding retrieval data constant and varying only timing: d = -0.68, CI crosses zero.
Cross-judge robustness. Sonnet ↔ Gemini-2.5-flash on n=1,080 paired scores: ρ = 0.722, κ = 0.558 (substantial agreement). Direction preserved across judges.

From the LongMemEval-S subsample (n=50, GPT-4o re-judged):

Baseline (no context): 14% accuracy. Continuity (BM25 top-5): 66% accuracy. +52 pp lift.
Inter-judge (Gemini Flash vs GPT-4o) agreement 90%, κ = 0.774.

Full numbers + methodology: reports/id-rag-parity-v2/EXPERIMENTAL_GAPS_ANALYSIS_V2.md.

Repository layout

fixtures/                Hand-authored fictional projects + LongMemEval data
  paydash-api/             v1 baseline (Express + Postgres + Redis)
  data-pipeline/           v2 (Kafka + Snowflake + Dagster)
  mobile-app/              v2 (React Native + iOS/Android)
  ml-platform/             v2 (training infra) — fixtures only, no benchmarks yet
  infra-platform/          v2 (cloud infra) — fixtures only, no benchmarks yet
  longmemeval/             External: xiaowu0162/longmemeval-cleaned subsamples

prompts/quizzes/         20-question recall quizzes per fixture
runners/                 TypeScript + Python benchmark runners
  shared/                  BM25 retriever, LLM clients, fixture loader
scripts/                 Bash drivers + Python subsampling
systems/                 Bring-your-own retrieval adapters
reports/                 Output directory (gitignored — regenerate locally)
docs/methodology.md      Protocol details for each runner
docs/reproducibility.md  End-to-end re-run instructions

Citing this work

@misc{continuity-benchmarks-2026,
  author = {Continuity contributors},
  title  = {Continuity Benchmarks: Execution-Intent Memory Evaluation for AI Coding Agents},
  year   = {2026},
  url    = {https://github.com/Alienfader/continuity-benchmarks}
}

License

MIT. Fixtures, runners, and scripts are all freely usable for academic and commercial work.

LongMemEval fixture data is downloaded at runtime from xiaowu0162/longmemeval-cleaned (MIT licensed, separate from this repo).

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
docs		docs
fixtures		fixtures
prompts/quizzes		prompts/quizzes
runners		runners
scripts		scripts
systems		systems
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.cmd		install.cmd
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

continuity-benchmarks

Quickstart

Benchmarks

ID-RAG matrix (fixtures + recall + alignment)

LongMemEval-S subsample (Gemini Flash, free-tier-friendly)

Custom retrieval system (bring your own)

What the benchmarks have found

Repository layout

Citing this work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

continuity-benchmarks

Quickstart

Benchmarks

ID-RAG matrix (fixtures + recall + alignment)

LongMemEval-S subsample (Gemini Flash, free-tier-friendly)

Custom retrieval system (bring your own)

What the benchmarks have found

Repository layout

Citing this work

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages