Skip to content

[DE-7784] Migrate sdk image deduplication to async mode#459

Merged
edwinpav merged 6 commits intomasterfrom
edwinpav/async-dedup
May 6, 2026
Merged

[DE-7784] Migrate sdk image deduplication to async mode#459
edwinpav merged 6 commits intomasterfrom
edwinpav/async-dedup

Conversation

@edwinpav
Copy link
Copy Markdown
Contributor

@edwinpav edwinpav commented May 6, 2026

Summary

  • Makes SDK image deduplication async-only: Dataset.deduplicate() and Dataset.deduplicate_by_ids() now return DeduplicationJob, with job.result() used to retrieve DeduplicationResult.
  • Removes SDK sync dedup support and updates dedup docs, tests, and changelog entries for the 0.18.1 release.
  • Refreshes related package metadata/docs dependencies needed by the 0.18.1 SDK update.

Greptile Summary

  • Dataset.deduplicate() and Dataset.deduplicate_by_ids() now POST to deduplicate_async and return a DeduplicationJob; callers retrieve results via job.result(), which blocks via sleep_until_complete then validates all required fields with _require_fields before constructing DeduplicationResult.
  • Sync deduplication path and its dataset_image_sync fixture/tests are removed; integration tests are migrated to the async pattern and two new unit tests cover routing and job-type population via FakeDeduplicationClient.
  • Dev doc dependencies updated: Sphinx >=4,<5>=7,<9, sphinx-autoapi ^1.8>=3,<4, docutils 0.170.21, furo refreshed to 2025.12.19.

Confidence Score: 5/5

PR is safe to merge; the async migration follows established AsyncJob patterns and properly addresses the previous silent-fallback concern.

No P0 or P1 issues found. The new DeduplicationJob correctly uses _require_fields to validate all result and stats fields before access, resolving the prior review comment. The implementation mirrors the existing EmbeddingsExportJob pattern exactly. Tests cover routing, job-type, and end-to-end async result retrieval.

No files require special attention.

Important Files Changed

Filename Overview
nucleus/deduplication.py New DeduplicationJob class extending AsyncJob, with _require_fields guard replacing the previous silent stats.get("threshold", 0) fallback; all required result/stats fields are validated before access.
nucleus/dataset.py Both deduplicate() and deduplicate_by_ids() now POST to deduplicate_async and return DeduplicationJob.from_json; clean removal of inline DeduplicationResult construction.
tests/test_deduplication.py Sync dataset fixture and sync tests removed; all integration tests migrated to use _deduplication_result() helper; two new unit tests verify async route and job type via FakeDeduplicationClient.
nucleus/init.py DeduplicationJob added to all and import; straightforward public API surface update.
pyproject.toml Version bumped to 0.18.1; Sphinx constraint widened to >=7,<9 and sphinx-autoapi to >=3,<4 to support newer doc toolchain.

Comments Outside Diff (1)

  1. nucleus/deduplication.py, line 271-280 (link)

    P2 Silent threshold fallback may produce misleading stats

    stats.get("threshold", 0) silently returns 0 if the server's job-status message omits the threshold field, while stats["original_count"] and stats["deduplicated_count"] will raise a raw KeyError if missing. The previous sync path always echoed back the caller-supplied threshold, so result.stats.threshold == 64 (as asserted in integration tests) now depends entirely on the server including the field. If it ever stops doing so the returned stats will silently report the wrong threshold (e.g. test_deduplicate_threshold_max would see 0 instead of 64) rather than failing fast with a clear error.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: nucleus/deduplication.py
    Line: 271-280
    
    Comment:
    **Silent threshold fallback may produce misleading stats**
    
    `stats.get("threshold", 0)` silently returns `0` if the server's job-status message omits the `threshold` field, while `stats["original_count"]` and `stats["deduplicated_count"]` will raise a raw `KeyError` if missing. The previous sync path always echoed back the caller-supplied threshold, so `result.stats.threshold == 64` (as asserted in integration tests) now depends entirely on the server including the field. If it ever stops doing so the returned stats will silently report the wrong threshold (e.g. `test_deduplicate_threshold_max` would see `0` instead of `64`) rather than failing fast with a clear error.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

Reviews (2): Last reviewed commit: "Address greptile" | Re-trigger Greptile

@edwinpav edwinpav self-assigned this May 6, 2026
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 6, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updateddocutils@​0.17.1 ⏵ 0.21.294 +110010010070
Updatedsphinx@​4.5.0 ⏵ 8.1.386 +12100100100100 +31
Updatedsphinx@​4.5.0 ⏵ 8.2.386 +12100100100100 +31
Addedroman-numerals-py@​4.1.010010090100100
Addedaccessible-pygments@​0.0.5100100100100100
Addedroman-numerals@​4.1.0100100100100100

View full report

@edwinpav edwinpav marked this pull request as ready for review May 6, 2026 18:11
@edwinpav edwinpav changed the title [DE-7784] Make SDK image deduplication async [DE-7784] Migrate sdk image deduplication to async mode May 6, 2026
@edwinpav edwinpav requested a review from a team May 6, 2026 18:38
Copy link
Copy Markdown

@luke-e-schaefer luke-e-schaefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming nobody using the existing sync endpoints via the client yet right

@edwinpav
Copy link
Copy Markdown
Contributor Author

edwinpav commented May 6, 2026

assuming nobody using the existing sync endpoints via the client yet right

Not that I know of externally. Internally, I'll let ML know about the change. For external, I'll tell Sam to point customers to the updated doc but they won't even have this version of the SDK until ECS updates their venv to be python3.10+.

In other words, until a user upgrades specifically to this SDK version, they won't reach issues since the sync REST endpoints are still in monorepo

@edwinpav edwinpav merged commit 3321deb into master May 6, 2026
9 checks passed
@edwinpav edwinpav deleted the edwinpav/async-dedup branch May 6, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants