[DE-7784] Migrate sdk image deduplication to async mode#459
Merged
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
luke-e-schaefer
approved these changes
May 6, 2026
luke-e-schaefer
left a comment
There was a problem hiding this comment.
assuming nobody using the existing sync endpoints via the client yet right
Contributor
Author
Not that I know of externally. Internally, I'll let ML know about the change. For external, I'll tell Sam to point customers to the updated doc but they won't even have this version of the SDK until ECS updates their venv to be python3.10+. In other words, until a user upgrades specifically to this SDK version, they won't reach issues since the sync REST endpoints are still in monorepo |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Dataset.deduplicate()andDataset.deduplicate_by_ids()now returnDeduplicationJob, withjob.result()used to retrieveDeduplicationResult.Greptile Summary
Dataset.deduplicate()andDataset.deduplicate_by_ids()now POST todeduplicate_asyncand return aDeduplicationJob; callers retrieve results viajob.result(), which blocks viasleep_until_completethen validates all required fields with_require_fieldsbefore constructingDeduplicationResult.dataset_image_syncfixture/tests are removed; integration tests are migrated to the async pattern and two new unit tests cover routing and job-type population viaFakeDeduplicationClient.>=4,<5→>=7,<9, sphinx-autoapi^1.8→>=3,<4, docutils0.17→0.21, furo refreshed to 2025.12.19.Confidence Score: 5/5
PR is safe to merge; the async migration follows established AsyncJob patterns and properly addresses the previous silent-fallback concern.
No P0 or P1 issues found. The new DeduplicationJob correctly uses _require_fields to validate all result and stats fields before access, resolving the prior review comment. The implementation mirrors the existing EmbeddingsExportJob pattern exactly. Tests cover routing, job-type, and end-to-end async result retrieval.
No files require special attention.
Important Files Changed
Comments Outside Diff (1)
nucleus/deduplication.py, line 271-280 (link)stats.get("threshold", 0)silently returns0if the server's job-status message omits thethresholdfield, whilestats["original_count"]andstats["deduplicated_count"]will raise a rawKeyErrorif missing. The previous sync path always echoed back the caller-supplied threshold, soresult.stats.threshold == 64(as asserted in integration tests) now depends entirely on the server including the field. If it ever stops doing so the returned stats will silently report the wrong threshold (e.g.test_deduplicate_threshold_maxwould see0instead of64) rather than failing fast with a clear error.Prompt To Fix With AI
Reviews (2): Last reviewed commit: "Address greptile" | Re-trigger Greptile