Skip to content

cuda.core: convert peer_accessible_by to a live MutableSet view#2018

Merged
leofang merged 13 commits intoNVIDIA:mainfrom
Andy-Jost:ajost/peer-accessible-by-set-proxy
May 6, 2026
Merged

cuda.core: convert peer_accessible_by to a live MutableSet view#2018
leofang merged 13 commits intoNVIDIA:mainfrom
Andy-Jost:ajost/peer-accessible-by-set-proxy

Conversation

@Andy-Jost
Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost commented May 4, 2026

Summary

DeviceMemoryResource.peer_accessible_by previously returned a sorted tuple[int, ...] backed by a Python-level cache. This is fragile in two ways: the cache can diverge from driver state across multiple wrappers around the same memory pool (#1720), and tuple semantics force callers into bespoke splice/replace patterns instead of standard set operations. This PR replaces the property with a live driver-backed collections.abc.MutableSet view, matching the proxy patterns established by AdjacencySetProxy (graphs) and the in-flight AccessedBySet (managed-memory advice, #1775).

This is a breaking change, captured in the v1.0.0 release notes.

Changes

New: PeerAccessibleBySetProxy

A MutableSet subclass living in cuda_core/cuda/core/_memory/_peer_access_utils.pyx.

  • Reads (__contains__, __iter__, __len__) call cuMemPoolGetAccess.
  • Writes (add, discard, and bulk ops) call cuMemPoolSetAccess.
  • Iteration yields Device objects.
  • add, discard, __contains__ accept Device | int.
  • The owner device is silently filtered (matching the existing planner contract).
  • All bulk operations (update, |=, &=, -=, ^=, clear) issue exactly one cuMemPoolSetAccess call. This matters: peer-access transitions can take seconds per pool because every existing memory mapping is updated, so coalescing into a single driver call lets the toolkit handle the mappings in parallel.

The proxy is constructed fresh on every property access. There is nothing to cache or pickle.

Cache removal

The _peer_accessible_by field on DeviceMemoryResource is dropped, along with its initializations in __cinit__, _DMR_init, and from_allocation_handle. The owned/non-owned read split is gone — every read path now queries the driver directly. This eliminates the bug class fixed in #1720 and simplifies the implementation.

Module consolidation

All peer-access internals now live in _peer_access_utils.pyx (promoted from a .py). The cdef inline driver helpers (_query_peer_access_ids, _peer_access_includes, _set_pool_access) and the cpdef replace_peer_accessible_by setter helper moved out of _device_memory_resource.pyx. DeviceMemoryResource now exposes only the public peer_accessible_by property: the getter returns PeerAccessibleBySetProxy(self) and the setter delegates to replace_peer_accessible_by(self, devices). This keeps the DMR surface focused on its memory-management role.

Property setter

mr.peer_accessible_by = [...] is preserved and unchanged in behavior. It still does a single batched cuMemPoolSetAccess via the same shared plan_peer_access_update path the proxy uses for bulk ops.

_query_peer_access_ids GIL optimization

The driver loop that enumerates peer device IDs now runs inside a single nogil block instead of acquiring/releasing the GIL once per device. Per-call data is collected into a libcpp.vector[int], and because range(total) ascends the result is already sorted (so the trailing sorted() is gone). This is a local performance tweak with no behavior change.

Test coverage

The existing integration tests (test_memory_peer_access.py, memory_ipc/test_peer_access.py) migrated from tuple equality to set-of-Device equality. Eight new tests pin the proxy's contract end-to-end, all using the existing mempool_device_x2 fixture so they run on CI:

  • MutableSet conformance. A new assert_single_member_mutable_set_interface(subject, member, non_member) helper exercises every MutableSet method against subjects whose backing store admits at most one insertable element (the peer-access proxy can hold at most {dev1} on a 2-GPU box). The original assert_mutable_set_interface(subject, items) is unchanged for normal multi-element subjects (AdjacencySetProxy).
  • Device/int interchangeability on add/discard/__contains__.
  • Owner-filtering contract on every write path (silent no-op).
  • Error paths: add(out_of_range) and add(non_coercible) raise; discard/__contains__ swallow the same inputs; remove(non_member) raises KeyError.
  • Live driver view: a proxy obtained before another wrapper modifies the pool reflects the change with no refresh step.
  • Iteration order is ascending by device_id; elements are Device instances; __repr__ shape; getter return type.
  • Single-call batching via a monkeypatch spy on _set_pool_access (the thin Python-visible helper extracted from _apply_peer_access_diff purely so tests can spy on the actual driver call). Every bulk op (|=/&=/-=/^=/update/difference_update/clear) and the property setter is asserted to issue exactly one cuMemPoolSetAccess, zero on no-op. This includes the dmr.peer_accessible_by |= {...} augmented-assignment-on-property pattern, where the empty-diff short-circuit prevents a second driver call from the trailing setter write.

Breaking change

DeviceMemoryResource.peer_accessible_by no longer returns a tuple[int, ...]. Callers must update:

  • Tuple comparisons (mr.peer_accessible_by == (0,)) -> set comparisons (mr.peer_accessible_by == {Device(0)}).
  • Iteration that expected ints -> iteration over Device objects (use [d.device_id for d in mr.peer_accessible_by] if int IDs are needed).

Setter usage (mr.peer_accessible_by = [...]) is unchanged.

Test plan

  • Existing test_memory_peer_access.py tests pass after migration to set semantics.
  • Existing test_memory_peer_access_utils.py tests still pass (plan_peer_access_update and normalize_peer_access_targets are unchanged).
  • Existing memory_ipc/test_peer_access.py tests pass (IPC-enabled pools, parent-child IPC scenarios).
  • Full cuda_core test suite passes on a single-GPU box (2931 passed, 212 skipped — peer-access tests gated on mempool_device_x2/x3).
  • All new peer-access tests pass on a 2-GPU box.
  • pre-commit run --all-files passes (ruff, format, SPDX, cython-lint, RST checks).

Related work

DeviceMemoryResource.peer_accessible_by previously returned a sorted
tuple[int, ...] backed by a Python-level cache, which was prone to
divergence from driver state across multiple wrappers around the same
memory pool. The setter accepted Device | int and emitted a single
batched cuMemPoolSetAccess covering the diff against the cache.

This commit replaces the property with a live driver-backed view:

- Adds PeerAccessibleBySetProxy in _memory/_peer_access_utils.py, a
  collections.abc.MutableSet whose reads call cuMemPoolGetAccess and
  whose writes call cuMemPoolSetAccess. Iteration yields Device
  objects; add, discard, and __contains__ accept either a Device or a
  device-ordinal int. The proxy is constructed fresh on every property
  access, so there is nothing to cache or pickle.

- Drops the _peer_accessible_by cache field (and its initializations
  in __cinit__, _DMR_init, and from_allocation_handle), eliminating
  the owned/non-owned read split. All pools now share the same code
  path and always query the driver.

- All bulk operations on the proxy (update, |=, &=, -=, ^=, clear,
  pop) issue exactly one cuMemPoolSetAccess call. Peer-access
  transitions can take seconds per pool because every existing memory
  mapping is updated, so coalescing into a single driver call lets the
  toolkit handle the mappings in parallel. The property setter
  (mr.peer_accessible_by = [...]) preserves its original single-call
  behavior via the same shared planner path.

- Single-element add validates can_access_peer through
  plan_peer_access_update, matching the existing setter contract.

This is a breaking change captured in the v1.0.0 release notes.
Callers comparing against tuples must update to set comparisons
(mr.peer_accessible_by == {Device(0)}). Existing tests are migrated;
new tests for set-interface conformance are intentionally deferred to
a follow-up.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost added this to the cuda.core v1.0.0 milestone May 4, 2026
@Andy-Jost Andy-Jost added P1 Medium priority - Should do feature New feature or request cuda.core Everything related to the cuda.core module breaking Breaking changes are introduced labels May 4, 2026
@Andy-Jost Andy-Jost self-assigned this May 4, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 4, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@leofang
Copy link
Copy Markdown
Member

leofang commented May 4, 2026

FYI: A breaking change must be labeled with P0.

…s.pyx

The previous commit left DeviceMemoryResource carrying three pass-through
def methods (_query_peer_access_ids, _peer_access_includes,
_apply_peer_access_diff) whose only purpose was to give the pure-Python
proxy in _peer_access_utils.py a way to call cdef helpers in
_device_memory_resource.pyx. These methods served no public role and
cluttered the class API.

Promote _peer_access_utils.py to a Cython module so the proxy and the
driver-touching helpers can live together:

- Convert _peer_access_utils.py to _peer_access_utils.pyx. cimports
  cydriver and DeviceMemoryResource from the .pxd; uses nogil and direct
  CUmemAccessDesc packing identically to before.

- Move _DMR_query_peer_access_ids, _DMR_peer_access_includes,
  _DMR_apply_peer_access_diff, and _DMR_replace_peer_accessible_by from
  _device_memory_resource.pyx into the new module as cdef helpers (and a
  cpdef replace_peer_accessible_by entry point used by the property
  setter).

- Drop the three pass-through def methods from DeviceMemoryResource. The
  class is left with the property getter and setter only; everything else
  is module-level in _peer_access_utils.

- The proxy now calls the module-level cdef helpers directly instead of
  routing through methods on mr.

No behavior change. The public surface (PeerAccessibleBySetProxy,
plan_peer_access_update, normalize_peer_access_targets, PeerAccessPlan)
is preserved at the same import paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost added P0 High priority - Must do! and removed P1 Medium priority - Should do labels May 4, 2026
Andy-Jost and others added 5 commits May 4, 2026 16:30
Refactor _query_peer_access_ids so the entire driver loop runs inside a
single nogil block instead of acquiring/releasing the GIL once per
device. The flag query now uses a cached as_cu(mr._h_pool) handle and
fills a libcpp.vector[int]; because range(total) ascends, the result is
already sorted and the trailing sorted() call is dropped.

Also tighten the peer_accessible_by entry in 1.0.0-notes.rst: the
breaking-change blurb only needs to state the type/element change, so
remove the implementation-flavored details about input acceptance and
batched cuMemPoolSetAccess calls.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ge cases

Existing peer-access tests covered the integration path well (real
copies across peers, the full transition matrix, shared-pool consistency)
but only touched ``in``, ``==``, and the property setter on the new set
proxy. After the v1.0.0 break that surfaced ~25 ``MutableSet`` methods,
nothing was pinning the type-coercion contract, the owner-filtering
behavior, the ``KeyError``/value-error paths, or the "one
``cuMemPoolSetAccess`` per bulk op" performance invariant.

Add the following coverage in ``test_memory_peer_access.py``:

- A ``MutableSet`` conformance test using a relaxed
  ``assert_mutable_set_interface`` mode that admits subjects holding at
  most one insertable element. CI maxes at two GPUs (one peer), so the
  multi-element protocol pass cannot run there. The new
  ``support_multi_insert=False`` path takes one insertable item plus two
  non-member sentinels and exercises every ``MutableSet`` method
  (``add``/``discard``/``remove``/``pop``/``clear``/``update``,
  comparisons, isdisjoint, subset/superset, binary and in-place
  operators, ``__iter__``/``__len__``/``__repr__``).
- ``Device``/``int`` interchangeability on ``add``/``discard``/``__contains__``.
- The owner-device filtering contract on every write (silent no-op).
- Error paths: ``add(out_of_range)`` and ``add(non_coercible)`` raise
  while the lenient ``discard``/``__contains__`` paths swallow the same
  inputs; ``remove(non_member)`` raises ``KeyError``.
- "Live driver view" semantics: a proxy obtained before another wrapper
  modifies the pool reflects the change with no refresh step.
- ``__iter__`` ordering is ascending by ``device_id`` and elements are
  ``Device`` instances; ``__repr__`` includes the class name and tracks
  live contents; the getter returns the documented proxy type.
- A batching spy that monkeypatches the module-level
  ``_apply_peer_access_diff`` and asserts that every bulk op
  (``|=``/``&=``/``-=``/``^=``/``update``/``difference_update``/
  ``clear``) and the property setter issues at most one driver call,
  zero when the diff is empty.

To make the spy possible, ``_apply_peer_access_diff`` is now a
Python-visible ``def`` wrapper around a renamed
``_apply_peer_access_diff_cython`` ``cdef inline``. The proxy and the
property setter still call ``_apply_peer_access_diff`` by bare name,
which Cython resolves through the module's globals at runtime, so a
``monkeypatch.setattr(_peer_access_utils, "_apply_peer_access_diff", ...)``
intercepts them. The extra Python-level dispatch is negligible next to
``cuMemPoolSetAccess`` itself.

Co-authored-by: Cursor <cursoragent@cursor.com>
Augmented assignment on the ``peer_accessible_by`` property
(``dmr.peer_accessible_by |= {...}``) is two trips through the
proxy/setter pair, not one: Python fetches the proxy, the proxy mutates
itself in place via ``__ior__``, and Python then assigns the
(already-mutated) proxy back through the setter. That trailing setter
call computes the diff against current driver state, finds it empty,
and short-circuits inside the ``cdef inline`` before issuing any
``cuMemPoolSetAccess`` work — so the *driver-level* contract ("one
batched call per bulk op") still holds, but the wrapper is invoked
twice, which the spy was counting.

Also, the fixture's ``dmr.peer_accessible_by = []`` reset on an already
empty pool is itself an empty-delta wrapper call.

Filter the recorded calls down to those with non-empty deltas (the ones
that translate to real driver work) and switch the bulk-ops test to use
a locally bound proxy so augmented assignment goes through ``__ior__``
once with no extra setter invocation. The setter test stays on
``dmr.peer_accessible_by = ...`` because that is the public API
contract under test there.

Co-authored-by: Cursor <cursoragent@cursor.com>
…hing

Move the actual ``cuMemPoolSetAccess`` invocation (descriptor-array
build + driver call) into a thin Python-visible ``def _set_pool_access``
in ``_peer_access_utils.pyx``. ``_apply_peer_access_diff`` now does only
the empty-diff short-circuit and delegates the work to
``_set_pool_access``, which Cython resolves through the module globals
at runtime so tests can intercept it via ``monkeypatch.setattr``.

Replace the previous internal-wrapper spy with a driver-call spy that
counts every real ``cuMemPoolSetAccess`` invocation. Earlier no-op
layers (e.g. the augmented-assignment-on-property pattern that writes
an already-mutated proxy back through the setter) short-circuit before
reaching ``_set_pool_access``, so the recorded count is exactly the
number of driver calls. The empty-delta filter and the local-binding
workaround in the bulk-ops test are gone; we now also assert that
``dmr.peer_accessible_by |= {...}`` directly on the property is still
exactly one driver call.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the ``support_multi_insert`` flag and ``non_members`` keyword
with two purpose-built helpers:

- ``assert_mutable_set_interface(subject, items)`` keeps the original
  signature and contract: at least five distinct insertable items,
  exercised against a reference set in the standard way. The graph
  ``AdjacencySetProxy`` test continues to use this unchanged.
- ``assert_single_member_mutable_set_interface(subject, member,
  non_member)`` is a focused pass for proxies whose backing store admits
  at most one insertable element at a time (here, the peer-access view
  on a 2-GPU box). It threads a single member and one non-member
  sentinel through every ``MutableSet`` method.

The two helpers share small private utilities (empty-state checks,
``__repr__`` shape) but keep their public surfaces small and linear.
A capacity-one proxy is a meaningfully different contract from a
general mutable set; naming that explicitly in the API reads better
than a flag and avoids forcing call sites to plumb sentinels through.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

@Andy-Jost Andy-Jost marked this pull request as ready for review May 5, 2026 00:33
@Andy-Jost Andy-Jost requested a review from rparolin May 5, 2026 00:34
Comment on lines +168 to +177
def assert_single_member_mutable_set_interface(subject, member, non_member):
"""Exercise every MutableSet method on a subject with capacity one.

Use this for proxies whose backing store admits at most one insertable
element at a time (typically because the underlying resource is bounded
by hardware, e.g. a peer-access view on a system with a single valid
peer device). The subject only ever holds ``set()`` or ``{member}``;
*non_member* supplies the right-hand side of comparisons, ``isdisjoint``,
subset/superset, and binary/in-place operators so every ``MutableSet``
method is exercised at least once.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rparolin you might be able to use this to test AccessedBySetProxy in #1775.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this PR merged and apply it in #1775 (assuming we still have time).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

@leofang
Copy link
Copy Markdown
Member

leofang commented May 5, 2026

Q: It is a big diff with a very niche use case. Are we sure this is P0?

@Andy-Jost
Copy link
Copy Markdown
Contributor Author

Andy-Jost commented May 5, 2026

Q: It is a big diff with a very niche use case. Are we sure this is P0?

The peer access list is currently a tuple. This change aligns with the set proxies in the graph module (AdjacencySetProxy) and the managed memory hint functions (AccessedBySetProxy). Without this, users would have to remember which interfaces uses a proxy and which do not. I think as a matter of course we should favor container proxies; they are more Pythonic.

This change is not functionally as big as it might appear. (1) Most of the bulk is just code movement: I moved the peer access implementation from _device_memory_resource.pyx to _peer_access_utils.pyx just to keep the DMR implementation clean. (2) I had to change _peer_access_utils from a .py to a .pyx file to accommodate the moved code, and git didn't identify it as a move/rename.

Andy-Jost added 4 commits May 5, 2026 11:32
…-by-set-proxy

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	cuda_core/docs/source/api_private.rst
…-by-set-proxy

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	cuda_core/docs/source/api_private.rst
#	cuda_core/docs/source/release/1.0.0-notes.rst
…-by-set-proxy

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	cuda_core/cuda/core/_memory/_device_memory_resource.pyx
leofang

This comment was marked as off-topic.

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall well-structured PR. Cache elimination is the right call, proxy pattern is consistent with AdjacencySetProxy / AccessedBySetProxy, and test coverage is thorough. Comments tagged Critical / Consideration / Nitpick.

Comment thread cuda_core/cuda/core/_memory/_peer_access_utils.pyx
dev_id = _resolve_peer_device_id(value)
except (TypeError, ValueError):
return
self._apply((), [dev_id])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration — double resolution in discard.

discard resolves value to dev_id (int) here, then passes [dev_id] into _apply, which iterates removals and calls _resolve_peer_device_id(value) again on each element (line 381). That second call does Device(int).device_id — an extra Device() construction per element.

For a single-element discard this is one redundant Device() allocation. Not wrong, but since cuda.core cares about O(100 ns)-level overhead, you could either:

  • have _apply accept already-resolved ints for removals (a separate parameter or a sentinel), or
  • pass the raw value through and let _apply do the only resolution.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2c is this level of optimization is not needed these APIs. Peer access changes iterate through all memory mappings and can take on the order of seconds to complete. I'd favor simpler code here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the problem of going through the driver API... We need to discuss some form of reverting post-1.0. CUDA driver APIs aren't friendly for peer access queries, and this query has a real cost in the user apps.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like Device(int) does not call any driver APIs though?

cdef DeviceMemoryResource mr = <DeviceMemoryResource>self._mr
owner_id = mr._dev_id
owner = Device(owner_id)
current = _query_peer_access_ids(mr)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration — every add/discard does a full device scan.

_apply unconditionally calls _query_peer_access_ids(mr) here, which loops over all devices with one cuMemPoolGetAccess per device. For add(dev1), a cheaper path would be:

  1. _peer_access_includes(mr, dev_id) — already granted? return.
  2. validate can_access_peer for the single device.
  3. _set_pool_access(mr, (dev_id,), ()).

That's 1 driver call for the common "already granted" case vs. N+1. For discard, similarly check _peer_access_includes first.

The current approach is correct and consistent (reuses the planner, validates can_access_peer), so this is about whether the per-device scan matters in practice. On a system with many GPUs (DGX-H100: 8 GPUs = 7 cuMemPoolGetAccess calls per add), it could add up.

# These tests exercise the ``PeerAccessibleBySetProxy`` surface added in
# v1.0.0. They run against ``mempool_device_x2`` because every CI machine has
# at most 2 GPUs, which means at most one valid peer device. The relaxed
# ``support_multi_insert=False`` path on ``assert_mutable_set_interface``
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick — stale comment. This mentions support_multi_insert=False path on assert_mutable_set_interface, but the actual code below calls assert_single_member_mutable_set_interface — a separate function, not a flag. Looks like a leftover from an earlier iteration.

``CUgraphConditionalHandle`` value. Previously, ``.handle`` had to be
extracted explicitly.

- :attr:`DeviceMemoryResource.peer_accessible_by` now returns a
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick — other breaking-change entries link to their PR/issue (e.g. #1950, #2016). This one could link to #2018 for consistency.

@leofang
Copy link
Copy Markdown
Member

leofang commented May 6, 2026

@Andy-Jost there are merge conflicts that need to be resolved

Andy-Jost and others added 2 commits May 6, 2026 13:17
…-by-set-proxy

Co-authored-by: Cursor <cursoragent@cursor.com>
- Optimize add/discard to check the single target device (1 driver
  call) instead of scanning all devices via _query_peer_access_ids
- Add test for out-of-range int in __contains__ (returns False)
- Fix stale comment referencing removed support_multi_insert flag
- Add NVIDIA#2018 link to release notes entry

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

commit 3233538afb addresses:

  • Release notes: added #2018 link to the peer_accessible_by entry for consistency with other breaking-change items
  • __contains__ with out-of-range int: investigated and confirmed not a bug — Device(bad_id) raises ValueError (not CUDAError) at _device.pyx:984, which the existing except (TypeError, ValueError) catches correctly. Added a test for bad_id not in proxy to lock this in.
  • Stale comment in test_memory_peer_access.py: updated to reference assert_single_member_mutable_set_interface instead of the removed support_multi_insert=False flag
  • add/discard full device scan: optimized both to use a single _peer_access_includes call (1 cuMemPoolGetAccess) instead of scanning all devices via _query_peer_access_ids (N calls). Bulk ops are unchanged — they already need the full set for diff computation.

@Andy-Jost Andy-Jost requested a review from leofang May 6, 2026 21:50
Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a caveat (#2018 (comment)).

@leofang leofang enabled auto-merge (squash) May 6, 2026 21:51
@leofang leofang merged commit 3432b28 into NVIDIA:main May 6, 2026
180 of 182 checks passed
@Andy-Jost Andy-Jost deleted the ajost/peer-accessible-by-set-proxy branch May 6, 2026 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking changes are introduced cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants