[DE-7784] Migrate sdk image deduplication to async mode by edwinpav · Pull Request #459 · scaleapi/nucleus-python-client

edwinpav · 2026-05-06T17:59:46Z

Summary

Makes SDK image deduplication async-only: Dataset.deduplicate() and Dataset.deduplicate_by_ids() now return DeduplicationJob, with job.result() used to retrieve DeduplicationResult.
Removes SDK sync dedup support and updates dedup docs, tests, and changelog entries for the 0.18.1 release.
Refreshes related package metadata/docs dependencies needed by the 0.18.1 SDK update.

Greptile Summary

Dataset.deduplicate() and Dataset.deduplicate_by_ids() now POST to deduplicate_async and return a DeduplicationJob; callers retrieve results via job.result(), which blocks via sleep_until_complete then validates all required fields with _require_fields before constructing DeduplicationResult.
Sync deduplication path and its dataset_image_sync fixture/tests are removed; integration tests are migrated to the async pattern and two new unit tests cover routing and job-type population via FakeDeduplicationClient.
Dev doc dependencies updated: Sphinx >=4,<5 → >=7,<9, sphinx-autoapi ^1.8 → >=3,<4, docutils 0.17 → 0.21, furo refreshed to 2025.12.19.

Confidence Score: 5/5

PR is safe to merge; the async migration follows established AsyncJob patterns and properly addresses the previous silent-fallback concern.

No P0 or P1 issues found. The new DeduplicationJob correctly uses _require_fields to validate all result and stats fields before access, resolving the prior review comment. The implementation mirrors the existing EmbeddingsExportJob pattern exactly. Tests cover routing, job-type, and end-to-end async result retrieval.

No files require special attention.

Important Files Changed

Filename	Overview
nucleus/deduplication.py	New DeduplicationJob class extending AsyncJob, with _require_fields guard replacing the previous silent stats.get("threshold", 0) fallback; all required result/stats fields are validated before access.
nucleus/dataset.py	Both deduplicate() and deduplicate_by_ids() now POST to deduplicate_async and return DeduplicationJob.from_json; clean removal of inline DeduplicationResult construction.
tests/test_deduplication.py	Sync dataset fixture and sync tests removed; all integration tests migrated to use _deduplication_result() helper; two new unit tests verify async route and job type via FakeDeduplicationClient.
nucleus/init.py	DeduplicationJob added to all and import; straightforward public API surface update.
pyproject.toml	Version bumped to 0.18.1; Sphinx constraint widened to >=7,<9 and sphinx-autoapi to >=3,<4 to support newer doc toolchain.

Comments Outside Diff (1)

nucleus/deduplication.py, line 271-280 (link)

Silent threshold fallback may produce misleading stats

stats.get("threshold", 0) silently returns 0 if the server's job-status message omits the threshold field, while stats["original_count"] and stats["deduplicated_count"] will raise a raw KeyError if missing. The previous sync path always echoed back the caller-supplied threshold, so result.stats.threshold == 64 (as asserted in integration tests) now depends entirely on the server including the field. If it ever stops doing so the returned stats will silently report the wrong threshold (e.g. test_deduplicate_threshold_max would see 0 instead of 64) rather than failing fast with a clear error.

Prompt To Fix With AI

This is a comment left during a code review.
Path: nucleus/deduplication.py
Line: 271-280

Comment:
**Silent threshold fallback may produce misleading stats**

`stats.get("threshold", 0)` silently returns `0` if the server's job-status message omits the `threshold` field, while `stats["original_count"]` and `stats["deduplicated_count"]` will raise a raw `KeyError` if missing. The previous sync path always echoed back the caller-supplied threshold, so `result.stats.threshold == 64` (as asserted in integration tests) now depends entirely on the server including the field. If it ever stops doing so the returned stats will silently report the wrong threshold (e.g. `test_deduplicate_threshold_max` would see `0` instead of `64`) rather than failing fast with a clear error.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "Address greptile" | Re-trigger Greptile}

socket-security · 2026-05-06T18:01:11Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Package	Supply Chain Security	License
docutils@0.17.1 ⏵ 0.21.2	⁺¹
sphinx@4.5.0 ⏵ 8.1.3	⁺¹²	⁺³¹
sphinx@4.5.0 ⏵ 8.2.3	⁺¹²	⁺³¹
roman-numerals-py@4.1.0
accessible-pygments@0.0.5
roman-numerals@4.1.0

View full report

luke-e-schaefer

assuming nobody using the existing sync endpoints via the client yet right

edwinpav · 2026-05-06T18:43:32Z

assuming nobody using the existing sync endpoints via the client yet right

Not that I know of externally. Internally, I'll let ML know about the change. For external, I'll tell Sam to point customers to the updated doc but they won't even have this version of the SDK until ECS updates their venv to be python3.10+.

In other words, until a user upgrades specifically to this SDK version, they won't reach issues since the sync REST endpoints are still in monorepo

edwinpav added 5 commits May 6, 2026 13:58

Add support for async dedup

4e33e81

Update sphinx package versions

552797f

Update sdk version to 0.18.1

612e3f3

Remove support for sync dedup in sdk

938d311

Updater CHANGELOG

b3f783d

edwinpav self-assigned this May 6, 2026

edwinpav marked this pull request as ready for review May 6, 2026 18:11

edwinpav changed the title ~~[DE-7784] Make SDK image deduplication async~~ [DE-7784] Migrate sdk image deduplication to async mode May 6, 2026

Address greptile

e48300a

edwinpav requested a review from a team May 6, 2026 18:38

luke-e-schaefer approved these changes May 6, 2026

View reviewed changes

edwinpav merged commit 3321deb into master May 6, 2026
9 checks passed

edwinpav deleted the edwinpav/async-dedup branch May 6, 2026 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DE-7784] Migrate sdk image deduplication to async mode#459

[DE-7784] Migrate sdk image deduplication to async mode#459
edwinpav merged 6 commits intomasterfrom
edwinpav/async-dedup

edwinpav commented May 6, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

socket-security Bot commented May 6, 2026 •

edited

Loading

Uh oh!

luke-e-schaefer left a comment

Uh oh!

edwinpav commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edwinpav commented May 6, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Comments Outside Diff (1)

Uh oh!

socket-security Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luke-e-schaefer left a comment

Choose a reason for hiding this comment

Uh oh!

edwinpav commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edwinpav commented May 6, 2026 •

edited by greptile-apps Bot

Loading

socket-security Bot commented May 6, 2026 •

edited

Loading

edwinpav commented May 6, 2026 •

edited

Loading