Touchstone

Model-independent verification for AI-coupled work.

What this is

Touchstone names the practice of measuring AI outputs without depending on AI to judge AI. It is one of two open reference artifacts published by Clarethium:

Touchstone validates work against quality standards.
Lodestone orients practice.

The Touchstone Standard specifies eleven measurement layers for output profiling: structural composition, claim density, source matching, grounding decomposition, and others. Ten of the eleven use deterministic regex, structural analysis, and arithmetic; one uses an optional LLM API for baseline generation only. The methodology rests on the principle that an auditor cannot be made of the same material as the audited.

This is a reference specification plus reference implementation. The Standard is the canonical text. The clarethium-touchstone library is the reference Python implementation.

What's here

This repository contains:

Touchstone Standard - the canonical specification (CC-BY 4.0) at STANDARDS/touchstone-1.0.md
clarethium-touchstone - Python reference implementation (Apache 2.0)

The Standard defines the methodology. The library implements it. Other implementations conforming to the Standard are welcome.

Status

Pre-launch on PyPI; reference implementation is feature-complete on main. All eleven measurement layers from Standard 1.0 are implemented and tested (375 tests, all CI green: lint, type check, test on Python 3.10/3.11/3.12, build distribution). Two reproducible validation benchmarks ship with the source; results are reproducible by anyone who clones the repo.

PyPI organization application pending approval. Until then, install from source:

git clone https://github.com/Clarethium/touchstone.git
cd touchstone
pip install -e .

Quick example

from clarethium_touchstone import measure

text = "Revenue grew 12% to $143M with 25% margins reported."
source = "Revenue grew 12% to $143M with 25% margins."

result = measure(text, source=source)

# Layer 4: number provenance
result["source_matching"]["unsourced_rate"]   # 0.0 - every number in source

# Layer 11: per-sentence Grounded / Framed / Projected decomposition
result["grounding_decomposition"]["proportions"]   # {"G": 1.0, "F": 0.0, "P": 0.0}
result["grounding_decomposition"]["has_projection"]  # False

The composite quality profile (Layer 10) requires ≥10 numbers in text for the source-fidelity contribution to qualify. For the substance vs presentation gap signal, supply a longer document:

text = (
    "Revenue grew 12% to $143M with 25% margins reported. "
    "Costs declined 8% across 5,000 employees over 18 months. "
    "Headcount reached 2,500 with $45,000 average compensation paid. "
    "Customer acquisition cost dropped to $1,200 from baseline. "
    "Retention improved 7.5% to 94.2% across all major segments."
)

result = measure(text, source=text)
result["quality_profile"]["substance_index"]   # 1.0 (self-source, all numbers grounded)
result["quality_profile"]["gap"]               # negative - substance exceeds presentation
result["quality_profile"]["components_available"]  # ["source_fidelity", "assertiveness", ...]

Layer 11's scope_assessment field tells you which signal to trust on a given source. The derivation checker saturates as the source's unique-number count grows; on number-dense sources (≥10 unique numbers), the primary unsourced-numbers P-signal effectively saturates and you should cross-reference Layer 4 source matching for numerical fabrication. The classifier is also exposed standalone:

from clarethium_touchstone import assess_derivation_regime

assessment = assess_derivation_regime(source_num_count=14)
assessment["derivation_regime"]                      # "saturated"
assessment["cross_reference_layer_4_for_numbers"]    # True
assessment["note_user_facing"]                       # UX-safe explanation

Boundaries are empirically validated against EXP-095 Monte Carlo data: < 5 = diagnostic, [5, 10) = transition, ≥ 10 = saturated.

For Layer 1a (heading defaultness), supply your own LLM client as a callable (vendor-neutral):

def baseline_generator(prompt: str) -> str | None:
    # Your LLM call here. Return generated text or None on failure.
    return your_llm_client.generate(prompt, temperature=1.0)

result = measure(
    text,
    source=source,
    topic="quarterly earnings analysis",
    baseline_generator=baseline_generator,
)
result["structural_profile"]["heading_defaultness"]
# {"jaccard_overlap": 0.33, "is_default": False, "n_baseline_documents": 3}

Library layers

The reference implementation covers every layer in Standard Section 5:

Layer	Function	Requires
1a heading defaultness	`structural_profile`	`topic` + `baseline_generator`
1b mechanism ratio	`structural_profile`	text
1c assertion ratio	`structural_profile`	text
2 claim density	`claim_density`	text
3 temporal instability	`temporal_instability`	text + `comparisons`
4 source matching	`source_matching`	text + `source`
5 entity provenance	`entity_provenance`	text + `source`
6 vocabulary proximity	`vocabulary_proximity`	text + `source`
7 presentation features	`presentation_features`	text
8 epistemic calibration	`epistemic_calibration`	text + `source`
9 information novelty	`information_novelty`	text
10 quality profile composite	`quality_profile`	text (substance from L3/L4/L5/L8)
11 grounding decomposition	`grounding_decomposition`	text + `source`

The top-level measure() orchestrator runs every layer whose preconditions are met. Layers without preconditions return None for that key in the MeasureResult dict.

Standard Section 6 (Specification Compliance) is not part of v0.1. The align() and profile() APIs are reserved for Standard 1.1. Touchstone v0.1 ships measurement only.

Empirical validation

Two reproducible benchmarks ship in benchmarks/. Anyone with a repo clone can run python -m benchmarks.exp_081_discrimination.run and python -m benchmarks.exp_095_grounding.run and reproduce the published numbers exactly.

EXP-081 adversarial discrimination

The methodology's core claim - that the composed quality_profile gap signal discriminates faithful AI outputs from embellished ones - is testable. The original EXP-081 paper measured this on a 12-document corpus and reported Cohen's d = -5.43 (CI [-9.077, -4.681]).

Touchstone reproduces this with d = -5.238 on the same corpus:

Metric	Faithful (N=6)	Embellished (N=6)
Mean gap (Touchstone)	-0.4377	+0.1585
Mean gap (published)	-0.443	+0.169
Cohen's d	-5.238 vs published -5.43
Per-doc gap-direction agreement with published	100% (12/12)
MAE on unsourced_rate / gap / substance / presentation	0.014 / 0.010 / 0.010 / 0.000

End-to-end empirical evidence that Layers 4, 5, 7, 8 composed via Layer 10 (quality_profile) reproduce the published validation result with no LLM calls.

EXP-095 grounding decomposition

Layer 11 (grounding_decomposition) classifies each sentence as Grounded / Framed / Projected. Validated against 13 hand-classified outputs across 3 source documents and 3 model families:

P-direction agreement: 100% on existence (P>0 vs P=0) - Touchstone never disagrees with manual classification on whether projected content exists in an output. Per-output P magnitude differs from manual range on 4/13 outputs; see the per-output table in benchmarks/exp_095_grounding/README.md for the magnitude breakdown.
MAE vs documented detector v0.3.1: 0.02-0.04 across G/F/P categories in aggregate

Per-output drift between Touchstone and published detector_v031 figures is documented honestly in benchmarks/exp_095_grounding/README.md - including a known case where the v1.4.1 derivation fix doesn't fully reproduce in current implementation.

Snapshot drift detection

Both benchmarks pin a dated JSON snapshot via byte-match pytest assertion. CI catches silent regression on any future change affecting per-doc predictions.

Use cases

AI integrity research and benchmarking
Internal AI quality verification at organizations
Substrate enforcement for AI-coupled work platforms
Independent third-party verification of AI vendor claims
Educational use in AI methodology courses

Why model-independent

LLM-as-judge approaches use AI to evaluate AI output. Touchstone uses regex, structural analysis, source matching, and arithmetic. The substrate does not depend on the model being measured. This matters when the auditor cannot be made of the same material as the audited.

Licensing

Standard: CC-BY 4.0 (content)
Library: Apache 2.0

Companions

Touchstone composes with the other Clarethium open reference artifacts:

Lodestone: methodology canon. The first-person practice that pairs with Touchstone's third-person measurement.
cma: executable compound-practice loop. Companion to Lodestone, surfacing relevant prior captures at the moment of action.
Sealstone: verification methodology for AI-assisted publish-class work. A specialization in the Lodestone tradition for the publish boundary; integrates Touchstone-class measurement at Tier 0 of its three-tier verification ladder.

Touchstone is also the substrate underneath Frame Check, Clarethium's applied frame-validation tool.

Contributing

See CONTRIBUTING.md for the contribution process. Standard changes follow the Suggestion workflow modeled on PEP-1 and BIP-1.

Citation

When citing the Standard:

Touchstone Standard 1.0 (2026), Clarethium.
https://github.com/Clarethium/touchstone/blob/main/STANDARDS/touchstone-1.0.md

When citing the library: BibTeX entry will be provided with the first published release.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github		.github
STANDARDS		STANDARDS
SUGGESTIONS		SUGGESTIONS
benchmarks		benchmarks
docs		docs
scripts		scripts
src/clarethium_touchstone		src/clarethium_touchstone
tests		tests
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Touchstone

What this is

What's here

Status

Quick example

Library layers

Empirical validation

EXP-081 adversarial discrimination

EXP-095 grounding decomposition

Snapshot drift detection

Use cases

Why model-independent

Licensing

Companions

Related

Contributing

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Touchstone

What this is

What's here

Status

Quick example

Library layers

Empirical validation

EXP-081 adversarial discrimination

EXP-095 grounding decomposition

Snapshot drift detection

Use cases

Why model-independent

Licensing

Companions

Related

Contributing

Citation

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages