A token-native knowledge base designed for LLM scale.
ContextFit keeps everything—storage, indexing, search, relationships, traversal, and commonality detection—inside discrete token-ID space until the very last step, when you decode only the final retrieved token chunks for the LLM's output.
- ~2× smaller storage than raw text (no repeated tokenization)
- Blazing-fast integer-only operations (no float embeddings)
- Hierarchical "geo-map-style" traversal for multi-hop reasoning
- Neural-network-like chunk relationships via token overlap graphs
- Automatic commonality discovery without vector spaces
- Direct LLM injection — feed
input_idsdirectly, no conversion - Structure-aware ingestion — Markdown sections, text paragraphs, and TMD ledger rows become retrievable units
┌─────────────────────────────────────────────────────────────────┐
│ ContextFit │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Storage │ │ Index │ │ Graph │ │
│ │ │ │ │ │ │ │
│ │ Token Arrays│ │ Inverted │ │ Chunk Relationships │ │
│ │ Chunk Store │ │ Suffix/FM │ │ Community Detection │ │
│ │ Compression │ │ BM25 Tokens │ │ Commonality Mining │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │
│ ┌─────────────────────────────┐ ┌─────────────────────────┐ │
│ │ Hierarchy │ │ Retrieval │ │
│ │ │ │ │ │
│ │ Level 0: Raw Chunks │ │ Query Tokenization │ │
│ │ Level 1+: Summary Clusters │ │ Graph Traversal │ │
│ │ Geo-Map Navigation │ │ Direct input_ids Output │ │
│ └─────────────────────────────┘ └─────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Semantic IDs (SIDs) ││
│ │ ││
│ │ Hierarchical token sequences → generative retrieval ││
│ │ Similar chunks share prefixes → trie-like navigation ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
- Token arrays (uint16/uint32 IDs)
- Memory-mapped files for large corpora
- Delta encoding + Zstd compression
- Chunk metadata headers
- Inverted Index: tokenID → [(chunkID, positions)] using Roaring bitmaps
- Suffix Array / FM-Index: Instant exact n-gram search
- BM25 on Tokens: TF-IDF scoring with token IDs as terms
- Binary postings pack: one compact
postings.bininstead of JSON-per-token files
- Nodes = chunks (or Semantic IDs)
- Edges = token n-gram overlap, Jaccard similarity, co-occurrence
- MinHash + LSH for fast similarity without floats
- Community detection for commonality discovery
- Level 0: Raw token chunks (256–1024 tokens each)
- Level 1+: Clustered summaries as token sequences
- GraphRAG-style community summaries
- Integer pointers for zoom navigation
- Tokenize query → search indexes → traverse graph → collect token IDs
- Feed directly as
input_idsto any LLM - No detokenization until final generation
- TMD ledger (
.tmd) files chunk by rows while preserving schema/front matter context; TMD ledger is a new ContextFit-proposed Tabular Markdown file format for row-addressable, human-readable ledgers .mdfiles chunk by heading/section boundaries and attachheading_pathmetadata.txtfiles chunk by paragraphs/separators with paragraph-level overlap.json/.jsonlfiles chunk by object/event records while preserving path, line, and index metadata.csv/.tsvfiles chunk by source rows while preserving headers as fields.emlfiles chunk email messages with sender, recipient, subject, and date context preserved.icsfiles chunk calendar events with summary, time, location, recurrence, and attendee metadata- Code files (
.py,.js,.ts,.go,.rs,.java,.c,.cpp,.sh,.sql,.css,.html, and more) chunk by generic symbol/import boundaries with language, symbol, and line-range metadata. - Unknown formats fall back to conservative token windows
- Assign each chunk a short hierarchical SID token sequence
- Similar chunks share prefixes via MinHash-band residual buckets
- Resolve generated/predicted SID prefixes through a trie with prefix backoff
- Retrieval mode:
--method sidor hybrid SID + BM25
- Predicts SID prefixes from query tokens without detokenizing
- Combines BM25 candidate chunks, MinHash similarity, and LSH neighbors
- Candidate chunks vote for hierarchical SID prefixes
- Returns generated SID predictions plus resolved chunk IDs
- Trains a sparse token→SID associative model from stored chunks
- Uses beam search over valid SID prefixes
- No neural dependency yet; still token-native and deterministic
- CLI:
contextfit ingest ./docs --train-sid-generator
# Install ContextFit
pip install contextfit
# Ingest a knowledge base
contextfit ingest ./documents --tokenizer tiktoken
# Query
contextfit query "What is ContextFit?"
# Query through Semantic IDs
contextfit query "async retrieval" --method sid
# Agent-friendly machine-readable output
contextfit query "What is ContextFit?" --method hybrid --json
contextfit search "Acme renewal audit" --json --extractive auto --compact
contextfit stats --json
# Run a deterministic sample benchmark
python examples/benchmark_sample_corpus.py --docs-per-topic 100 --json
# Run needle-in-a-haystack benchmark
python examples/benchmark_needle_haystack.py --needles 20 --distractors 200 --top-k 5 --json
# Ingest and train the learned SID generator
contextfit ingest ./documents --train-sid-generatorFor contributor installs from source, clone the repo and run pip install -e . from the project root.
For installing on a MacBook/OpenClaw node, see docs/MACBOOK_CLI_DEPLOY.md.
For OpenClaw integration, including the contextfit_search tool and contextfit context engine plugin, see docs/OPENCLAW_INTEGRATION.md.
For Claude Desktop, ContextFit includes a local MCP stdio server. See docs/CLAUDE_DESKTOP_MCP.md or run:
contextfit --kb ~/contextfit_kb mcp--json is intended for OpenClaw/agent use. Query JSON includes input_ids, retrieved chunk metadata, SID predictions, semantic IDs, and decoded previews. contextfit search --json --extractive auto also returns deterministic query-focused evidence (tmd_row, bullet, or span) with source line and row IDs when available. Add --compact to emit a TMD-like compact_context string that preserves citations while avoiding JSON-heavy metadata in an LLM prompt.
contextfit_kb/
chunks/
chunks.bin # zstd-compressed token-array records
index.json # chunk_id → byte offset/length
inverted/
meta.json # corpus/index metadata
postings.bin # compact binary token → roaring bitmap + positions pack
sid/
semantic_ids.json
learned_sid_generator.json
The inverted index now saves as a single binary postings pack by default. Legacy JSON-per-token indexes still load for compatibility.
🚧 Early Development — Architecture phase
- TERAG: Token-Efficient GraphRAG (3–11% token reduction)
- Semantic IDs / Generative Retrieval
- GraphRAG community detection
- Letta's token-space learning
MIT