Skip to content

Add preservation-mirror fields to DataReleaseManifest#317

Merged
MaxGhenis merged 2 commits intomainfrom
add-preservation-mirror-fields
May 7, 2026
Merged

Add preservation-mirror fields to DataReleaseManifest#317
MaxGhenis merged 2 commits intomainfrom
add-preservation-mirror-fields

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

  • New `PreservationMirror` model (`kind`, `url`, optional `doi` / `sha256` / `deposited_at`).
  • New `preservation_mirrors` list on each `DataReleaseArtifact` — per-artifact preservation deposits.
  • New `preservation_dois` list on `DataReleaseManifest` — release-level DOIs (Zenodo mints one per deposit covering all files).
  • All fields have defaults; verified with a backwards-compatibility test that loads a legacy manifest JSON blob.
  • 9 new tests; full non-integration suite green (444 passed).

Why

This is the data contract for the Zenodo-mirror workstream in PolicyEngine/policyengine-us-data#810. The us-data Modal build will deposit each certified h5 to Zenodo and populate these fields when emitting the `DataReleaseManifest` to HuggingFace. The TRACE TRO emission helpers will then read these fields to record durable fallback locations in every TRO.

Motivated by the 2026-04-21 meeting with Lars Vilhuber (AEA Data Editor): HuggingFace doesn't publish a preservation commitment, so a TRO citation URL that resolves only through HF can 404 decades from now. Zenodo (CERN / OpenAIRE-operated, DOI-minting) is the reference preservation-grade host Lars pointed at.

Test plan

  • `uv run pytest tests/test_preservation_mirror.py -v` → 9/9 pass
  • `uv run pytest tests/ --ignore=tests/integration` → 444 pass, no regressions
  • `uv run ruff format` clean

Related

🤖 Generated with Claude Code

Extends the data-release manifest model to carry optional
preservation-grade mirror metadata:

- New PreservationMirror model with kind ('zenodo', 'archival_gcs',
  etc.), url, and optional doi / sha256 / deposited_at fields.
- New preservation_mirrors list on each DataReleaseArtifact, for
  per-artifact mirrors (Zenodo file deposits, GCS archival copies).
- New preservation_dois list on DataReleaseManifest for release-level
  DOIs (Zenodo mints one per deposit covering all files).

All new fields have defaults and the existing manifest JSON schema
continues to validate unchanged — verified with a backwards-
compatibility test that loads a legacy manifest JSON blob.

This is the data contract for the Zenodo-mirror workstream scoped in
PolicyEngine/policyengine-us-data#810: the us-data Modal build will
deposit each certified h5 to Zenodo and populate these fields when
emitting the DataReleaseManifest to HuggingFace. The TRACE TRO
emission helpers will then read preservation_mirrors / preservation_dois
to record durable fallback locations in every TRO it builds.

Motivation (2026-04-21 meeting with Lars Vilhuber / AEA Data Editor):
HuggingFace doesn't publish a preservation commitment, so a TRO
citation URL that resolves only through HF can 404 decades from now.
Zenodo (CERN / OpenAIRE-operated, DOI-minting) is the reference
preservation-grade host Lars pointed at.

9 new tests; full non-integration suite green (444 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit 35e958a into main May 7, 2026
11 checks passed
@MaxGhenis MaxGhenis deleted the add-preservation-mirror-fields branch May 7, 2026 11:41
MaxGhenis added a commit to PolicyEngine/policyengine-us-data that referenced this pull request May 7, 2026
Groundwork for the Zenodo upload workstream (issue #810): durable
mirror of each certified microdata release to a preservation-grade
host (CERN/OpenAIRE-operated, DOI-minting) so TRO citation URLs stay
verifiable decades from now even if HuggingFace changes its hosting.

- New policyengine_us_data/utils/zenodo_client.py: typed wrapper
  around the Zenodo REST API. One public function,
  create_and_publish_deposit(), handles the four-step Zenodo flow
  (create deposit, upload files, set metadata, publish) and returns
  the version + concept DOIs plus per-file download URLs and
  checksums. Env-var gated: ZENODO_ACCESS_TOKEN must be set or the
  function raises ZenodoNotConfigured, which callers should treat
  as 'preservation mirroring disabled for this release' rather than
  a failure.
- Extends build_release_manifest() with two new kwargs:
  preservation_mirrors_by_artifact (per-artifact Zenodo or other
  mirror metadata) and preservation_dois (release-level Zenodo DOIs).
  Populates the fields introduced in PolicyEngine/policyengine.py#317
  on the emitted manifest JSON.
- 11 zenodo-client tests (happy path, missing token, missing file,
  API error wrapping, metadata payload serialization, env-var
  handling). 3 release-manifest tests (no fields when not provided,
  per-artifact mirror preserved, empty list treated as absent).
- Full unit suite green (853 passed, 3 pre-existing skips).

Modal-build wiring is deferred to a follow-up PR that requires a real
Zenodo access token and a sandbox test round-trip. This commit is
the contract + client + tests, with no production behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MaxGhenis added a commit to PolicyEngine/policyengine-us-data that referenced this pull request May 7, 2026
Groundwork for the Zenodo upload workstream (issue #810): durable
mirror of each certified microdata release to a preservation-grade
host (CERN/OpenAIRE-operated, DOI-minting) so TRO citation URLs stay
verifiable decades from now even if HuggingFace changes its hosting.

- New policyengine_us_data/utils/zenodo_client.py: typed wrapper
  around the Zenodo REST API. One public function,
  create_and_publish_deposit(), handles the four-step Zenodo flow
  (create deposit, upload files, set metadata, publish) and returns
  the version + concept DOIs plus per-file download URLs and
  checksums. Env-var gated: ZENODO_ACCESS_TOKEN must be set or the
  function raises ZenodoNotConfigured, which callers should treat
  as 'preservation mirroring disabled for this release' rather than
  a failure.
- Extends build_release_manifest() with two new kwargs:
  preservation_mirrors_by_artifact (per-artifact Zenodo or other
  mirror metadata) and preservation_dois (release-level Zenodo DOIs).
  Populates the fields introduced in PolicyEngine/policyengine.py#317
  on the emitted manifest JSON.
- 11 zenodo-client tests (happy path, missing token, missing file,
  API error wrapping, metadata payload serialization, env-var
  handling). 3 release-manifest tests (no fields when not provided,
  per-artifact mirror preserved, empty list treated as absent).
- Full unit suite green (853 passed, 3 pre-existing skips).

Modal-build wiring is deferred to a follow-up PR that requires a real
Zenodo access token and a sandbox test round-trip. This commit is
the contract + client + tests, with no production behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MaxGhenis added a commit to PolicyEngine/policyengine-us-data that referenced this pull request May 7, 2026
Groundwork for the Zenodo upload workstream (issue #810): durable
mirror of each certified microdata release to a preservation-grade
host (CERN/OpenAIRE-operated, DOI-minting) so TRO citation URLs stay
verifiable decades from now even if HuggingFace changes its hosting.

- New policyengine_us_data/utils/zenodo_client.py: typed wrapper
  around the Zenodo REST API. One public function,
  create_and_publish_deposit(), handles the four-step Zenodo flow
  (create deposit, upload files, set metadata, publish) and returns
  the version + concept DOIs plus per-file download URLs and
  checksums. Env-var gated: ZENODO_ACCESS_TOKEN must be set or the
  function raises ZenodoNotConfigured, which callers should treat
  as 'preservation mirroring disabled for this release' rather than
  a failure.
- Extends build_release_manifest() with two new kwargs:
  preservation_mirrors_by_artifact (per-artifact Zenodo or other
  mirror metadata) and preservation_dois (release-level Zenodo DOIs).
  Populates the fields introduced in PolicyEngine/policyengine.py#317
  on the emitted manifest JSON.
- 11 zenodo-client tests (happy path, missing token, missing file,
  API error wrapping, metadata payload serialization, env-var
  handling). 3 release-manifest tests (no fields when not provided,
  per-artifact mirror preserved, empty list treated as absent).
- Full unit suite green (853 passed, 3 pre-existing skips).

Modal-build wiring is deferred to a follow-up PR that requires a real
Zenodo access token and a sandbox test round-trip. This commit is
the contract + client + tests, with no production behavior change.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant