feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization#1
feat(generators): add --deterministic flag with hybrid RDFC-1.0 + rdflib serialization#1jdsika wants to merge 8 commits into
Conversation
bdb0f7a to
6544b72
Compare
f0a081a to
7f529d6
Compare
7f529d6 to
37cafc8
Compare
fb47790 to
8016a4b
Compare
🔍 Adversarial Review — PR #1SummaryA well-engineered feature with strong documentation and benchmark data. The three-phase pipeline (RDFC-1.0 → WL → rdflib) is architecturally sound but introduces significant complexity. I found 2 bugs (dead code shipped as functional features), 1 algorithmic concern in collision handling, and several design/test gaps worth addressing before merge. 🐛 Bugs & Issues1. Dead code: The # Added to Generator dataclass but never checked anywhere:
normalize_prefixes: bool = False2. Dead code: The function is defined in 3. WL collision counter assignment depends on c14n ordering, not structure In for bid in sorted(bnode_ids): # sorted by c14n ID, NOT by structure
digest = hashlib.sha256(sig[bid].encode("utf-8")).hexdigest()[:12]
count = seen_hashes.get(digest, 0)
seen_hashes[digest] = count + 1
label = f"b{digest}" if count == 0 else f"b{digest}_{count}"Adding an unrelated triple can change RDFC-1.0 numbering, which changes which colliding node gets the base label vs for bid in sorted(bnode_ids, key=lambda b: (sig[b], b)):
|
d9c1a07 to
5da3f77
Compare
cfaba19 to
c4ecf10
Compare
c4ecf10 to
cde9fb7
Compare
Layers Weisfeiler-Lehman structural hashing on top of upstream RDFC-1.0 canonicalization (linkml#3407) to produce diff-stable blank node identifiers. RDFC-1.0 remains always-on as the default serialization; the --deterministic flag adds WL hashing for version-controlled artifacts. Three-phase pipeline (deterministic_turtle): 1. RDFC-1.0 via pyoxigraph — canonical triple ordering 2. WL structural hashing — content-based blank node IDs 3. rdflib re-serialization — idiomatic Turtle syntax Additional --deterministic behaviours: - deterministic_json() for JSON-LD context output - Sorted owl:oneOf, sh:in, sh:ignoredProperties members - Sorted any_of/exactly_one_of expression members in OWL Fixes from review (#1): - Remove trailing newline from context generator return values (avoids double-newline when CLI prints output) - Sort WL collision counter by signature, not c14n ID (prevents unrelated triples from swapping _0/_1 suffixes) - Remove dead code: well_known_prefix_map(), normalize_prefixes (belong in separate --normalize-prefixes PR) Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
cde9fb7 to
996ca58
Compare
Layers Weisfeiler-Lehman structural hashing on top of upstream RDFC-1.0 canonicalization (linkml#3407) to produce diff-stable blank node identifiers. RDFC-1.0 remains always-on as the default serialization; the --deterministic flag adds WL hashing for version-controlled artifacts. Three-phase pipeline (deterministic_turtle): 1. RDFC-1.0 via pyoxigraph — canonical triple ordering 2. WL structural hashing — content-based blank node IDs 3. rdflib re-serialization — idiomatic Turtle syntax Additional --deterministic behaviours: - deterministic_json() for JSON-LD context output - Sorted owl:oneOf, sh:in, sh:ignoredProperties members - Sorted any_of/exactly_one_of expression members in OWL Fixes from review (#1): - Remove trailing newline from context generator return values (avoids double-newline when CLI prints output) - Sort WL collision counter by signature, not c14n ID (prevents unrelated triples from swapping _0/_1 suffixes) - Remove dead code: well_known_prefix_map(), normalize_prefixes (belong in separate --normalize-prefixes PR) Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
996ca58 to
8a1dd70
Compare
Layers Weisfeiler-Lehman structural hashing on top of upstream RDFC-1.0 canonicalization (linkml#3407) to produce diff-stable blank node identifiers. RDFC-1.0 remains always-on as the default serialization; the --deterministic flag adds WL hashing for version-controlled artifacts. Three-phase pipeline (deterministic_turtle): 1. RDFC-1.0 via pyoxigraph — canonical triple ordering 2. WL structural hashing — content-based blank node IDs 3. rdflib re-serialization — idiomatic Turtle syntax Additional --deterministic behaviours: - deterministic_json() for JSON-LD context output - Sorted owl:oneOf, sh:in, sh:ignoredProperties members - Sorted any_of/exactly_one_of expression members in OWL Fixes from review (#1): - Remove trailing newline from context generator return values (avoids double-newline when CLI prints output) - Sort WL collision counter by signature, not c14n ID (prevents unrelated triples from swapping _0/_1 suffixes) - Remove dead code: well_known_prefix_map(), normalize_prefixes (belong in separate --normalize-prefixes PR) Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
8a1dd70 to
8685f2a
Compare
Layers Weisfeiler-Lehman structural hashing on top of upstream RDFC-1.0 canonicalization (linkml#3407) to produce diff-stable blank node identifiers. RDFC-1.0 remains always-on as the default serialization; the --deterministic flag adds WL hashing for version-controlled artifacts. Three-phase pipeline (deterministic_turtle): 1. RDFC-1.0 via pyoxigraph — canonical triple ordering 2. WL structural hashing — content-based blank node IDs 3. rdflib re-serialization — idiomatic Turtle syntax Additional --deterministic behaviours: - deterministic_json() for JSON-LD context output - Sorted owl:oneOf, sh:in, sh:ignoredProperties members - Sorted any_of/exactly_one_of expression members in OWL Fixes from review (#1): - Remove trailing newline from context generator return values (avoids double-newline when CLI prints output) - Sort WL collision counter by signature, not c14n ID (prevents unrelated triples from swapping _0/_1 suffixes) - Remove dead code: well_known_prefix_map(), normalize_prefixes (belong in separate --normalize-prefixes PR) Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
8685f2a to
eac6364
Compare
Layers Weisfeiler-Lehman structural hashing on top of upstream RDFC-1.0 canonicalization (linkml#3407) to produce diff-stable blank node identifiers. RDFC-1.0 remains always-on as the default serialization; the --deterministic flag adds WL hashing for version-controlled artifacts. Three-phase pipeline (deterministic_turtle): 1. RDFC-1.0 via pyoxigraph — canonical triple ordering 2. WL structural hashing — content-based blank node IDs 3. rdflib re-serialization — idiomatic Turtle syntax Additional --deterministic behaviours: - deterministic_json() for JSON-LD context output - Sorted owl:oneOf, sh:in, sh:ignoredProperties members - Sorted any_of/exactly_one_of expression members in OWL Fixes from review (#1): - Remove trailing newline from context generator return values (avoids double-newline when CLI prints output) - Sort WL collision counter by signature, not c14n ID (prevents unrelated triples from swapping _0/_1 suffixes) - Remove dead code: well_known_prefix_map(), normalize_prefixes (belong in separate --normalize-prefixes PR) Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
eac6364 to
78444d2
Compare
Add a --deterministic / --no-deterministic CLI flag (default off) to OWL, SHACL, JSON-LD Context, and JSON-LD generators that produces diff-stable output using Weisfeiler-Lehman structural hashing on top of the RDFC-1.0 canonicalization from upstream (linkml#3407). Three-phase hybrid pipeline (when --deterministic is set): 1. RDFC-1.0 canonicalization (upstream) produces sequential _:c14nN IDs 2. Weisfeiler-Lehman structural hashing replaces sequential IDs with content-based _:b<sha256> hashes that remain stable when unrelated triples are added/removed 3. rdflib re-serialization recovers idiomatic Turtle (inline blank nodes, collection syntax, filtered prefixes, preserved xsd:string) Without --deterministic, upstream's always-on RDFC-1.0 canonicalization is used directly (via canonicalize_rdf_graph). Additional features gated behind --deterministic: - Expression sorting (any_of/all_of/none_of/exactly_one_of) in owlgen - Collection sorting (sh:in, sh:ignoredProperties) in shaclgen - Permissible value sorting in owlgen and shaclgen - JSON-LD deterministic key ordering (deterministic_json) - JSON-LD context structured ordering (jsonldcontextgen) Rebased on top of upstream linkml#3407 (pyoxigraph RDFC-1.0). Refs: linkml#1847, linkml#3407 Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
ba4916b to
f6935a4
Compare
rdflib's Turtle serializer always emits a trailing double newline. Normalize to single newline in deterministic_turtle() and the rdflib fallback path in canonicalize_rdf_graph() for consistent file endings. Note: CLI print() still adds a newline after serialize()'s trailing newline. Callers capturing stdout should strip trailing blank lines (e.g. via sed). Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
f6935a4 to
b5f5ce1
Compare
Summary
Add a
--deterministicflag to OWL, SHACL, and JSON-LD generators that produces byte-identical output across invocations, eliminating spurious diffs in version-controlled artifacts.This is a review-ready fork of the approach discussed in upstream linkml/linkml#3295, rebuilt to address maintainer feedback.
Problem
Generated OWL and SHACL artifacts contain blank nodes whose identifiers change between runs due to Python dict ordering and rdflib serialization non-determinism. This makes version-controlled artifacts show massive diffs even when the underlying schema change is trivial.
Solution
Three-Phase Hybrid Pipeline (
deterministic_turtle())_:c14nNidentifiers with content-based hashes. These depend only on predicate IRIs, literal values, and named-node IRIs — not on blank-node numbering — so adding or removing a triple only affects directly involved blank nodes.Graphand serializes with rdflib's native Turtle writer. This recovers idiomatic Turtle features that pyoxigraph cannot emit:[ … ]) for singly-referenced blank nodes (Turtle §2.7)( … )) forrdf:Listchains (Turtle §2.8)All triples from the source graph are preserved — the hybrid step only changes syntactic form, never semantic content. Plain string literals have their
xsd:stringdatatype stripped per RDF 1.1 §2.5.1 (simple literals are syntactic sugar forxsd:string).Additional Features
Collection sorting (gated behind
--deterministic):owl:oneOf,sh:in,sh:ignoredPropertiesitems are sorted when the flag is setdeterministic_json():Benchmark Results
Tested on the Gaia-X Trust Framework ontology (~68K OWL / ~165K SHACL triples) and schema.org (~18K triples):
Semantic Equivalence
rdflib.compare.isomorphic()TrueTrueTrueByte-Level Stability
Diff Quality (Signal-to-Noise Ratio)
Controlled mutations on a LinkML schema:
Output Size (Gaia-X Trust Framework)
The SHACL 18× size reduction comes from replacing 157,552 named
_:bHASHblank nodes with inline[ … ]syntax and 77,358 explicitrdf:first/rdf:resttriples with( … )collection shorthand — matching the upstream Gaia-X registry convention.Performance
Dependency
pyoxigraph >= 0.4.0is imported lazily only when--deterministicis used. It is not a core dependency, avoiding conflict withmorph-kgc's pin onpyoxigraph < 0.4.0. Tests skip gracefully when pyoxigraph >= 0.4.0 is unavailable.Relationship to upstream linkml#3295
The original PR was closed after maintainer feedback requesting an established canonicalization standard. This PR:
Testing
test_deterministic_output.py: 27 tests (stability, sorting, prefix format, enum ordering, kitchen_sink)test_deterministic_benchmark.py: 10 local + 4 network tests (schema.org equivalence, mutation diff quality, signal-to-noise assertions)Benchmark Test Assertions
The benchmark enforces quantitative properties:
References