Skip to content

cuda.core: Cythonize GraphBuilder and Graph with handle-layer cleanup#2008

Open
Andy-Jost wants to merge 8 commits intoNVIDIA:mainfrom
Andy-Jost:graph-builder-refactor
Open

cuda.core: Cythonize GraphBuilder and Graph with handle-layer cleanup#2008
Andy-Jost wants to merge 8 commits intoNVIDIA:mainfrom
Andy-Jost:graph-builder-refactor

Conversation

@Andy-Jost
Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost commented May 1, 2026

Summary

Convert GraphBuilder and Graph from Python classes (using _MembersNeededForFinalize + weakref.finalize) to Cython cdef class objects backed by typed C++ resource handles.

This does two things. First, it lays groundwork for step 3 of #1330 (graph updates) by giving graph objects the same handle-based ownership pattern as the rest of cuda.core. Second, it clarifies GraphBuilder's state machine: what used to be a tangle of implicit flags and conditional cleanup paths is now two orthogonal enums — _BuilderKind (PRIMARY/FORKED/CONDITIONAL_BODY) describing how the builder was created, and _CaptureState (CAPTURE_NOT_STARTED/CAPTURING/CAPTURE_ENDED) tracking the capture lifecycle. Methods can now check exactly the state they care about, illegal transitions are detectable, and __dealloc__ has a single, well-defined condition for ending capture.

Changes

  • Add GraphExecHandle to the resource-handle layer (_cpp/resource_handles.{hpp,cpp}, _resource_handles.{pxd,pyx}), wrapping CUgraphExec with a cuGraphExecDestroy-based deleter run under GILReleaseGuard.
  • GraphBuilder becomes a cdef class with the explicit _BuilderKind/_CaptureState enums described above. Live-API methods (begin_building, end_building, embed, split, join, etc.) move to nogil cydriver paths where practical, and end-of-capture in __dealloc__ runs against the cached StreamHandle rather than reaching into a possibly-cleared Stream attribute.
  • Graph becomes a cdef class holding GraphExecHandle _h_graph_exec directly; update/upload/launch move to nogil cydriver. weakref.finalize is gone.
  • Device.create_graph_builder and Stream.create_graph_builder cimport GraphBuilder and call its _init factory; quoted forward-reference annotations are removed (clears Cython "Strings should no longer be used for type declarations" warnings).

Related work

Andy-Jost added 3 commits May 1, 2026 14:52
…hine

Refactor GraphBuilder from a Python class using _MembersNeededForFinalize
to a cdef class with explicit _BuilderKind (PRIMARY/FORKED/CONDITIONAL_BODY)
and _CaptureState (NOT_STARTED/CAPTURING/ENDED) tracking. Cleanup moves
into __dealloc__/close, and the builder now uses GraphHandle/StreamHandle
from _resource_handles instead of holding raw driver objects. Drop the
is_stream_owner flag now that StreamHandle owns the lifetime.

End-capture paths in __dealloc__ and close guard on _h_stream so cleanup
is safe even if _init* fails before completing assignment.

Made-with: Cursor
Add a GraphExecHandle to the resource-handle layer (parallel to
GraphHandle) wrapping CUgraphExec with RAII cleanup via
cuGraphExecDestroy on shared_ptr release. Convert Graph from a Python
class using _MembersNeededForFinalize to a cdef class holding a typed
_h_graph_exec attribute, dropping the weakref.finalize machinery.
update/upload/launch move to nogil cydriver paths consistent with the
GraphBuilder rewrite.

Also drop quoted forward-reference annotations on create_graph_builder
and _instantiate_graph/complete now that GraphBuilder is cimported in
_device.pyx and _stream.pyx and Cython accepts the in-module forward
reference to Graph. Clears the related "Strings should no longer be
used for type declarations" warnings.

Made-with: Cursor
The cdef-class member declarations live in the .pxd, so the .pyx does
not need to re-cimport GraphExecHandle, GraphHandle, or StreamHandle.

Made-with: Cursor
@Andy-Jost Andy-Jost added this to the cuda.core v1.0.0 milestone May 1, 2026
@Andy-Jost Andy-Jost added enhancement Any code-related improvements P1 Medium priority - Should do cuda.core Everything related to the cuda.core module labels May 1, 2026
@Andy-Jost Andy-Jost self-assigned this May 1, 2026
… cycle

cimport-ing GraphBuilder at the top of _stream.pyx and _device.pyx made
Cython emit a Python-level import of cuda.core.graph._graph_builder
during _stream module init. That triggered the chain
graph -> _graph_node -> _kernel_arg_handler -> _memory._buffer
-> _device, which then re-entered the still-initializing _stream module
via "from cuda.core._stream import IsStreamT", failing with
ImportError: cannot import name IsStreamT.

Restore the original lazy "import GraphBuilder" inside
create_graph_builder (Stream and Device) and Stream_accept. The return
annotations stay as bare names; "from __future__ import annotations" in
both files defers their evaluation, so they need not resolve at
function-definition time.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

@Andy-Jost Andy-Jost marked this pull request as draft May 4, 2026 17:12
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 4, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

The previous import-cycle fix changed _stream/_device.create_graph_builder
to a lazy Python "import GraphBuilder" instead of a module-level cimport.
With _init declared as @staticmethod cdef, Python attribute lookup
cannot find it, so every test that builds a graph failed with
"AttributeError: type object 'GraphBuilder' has no attribute '_init'"
at _device.pyx:1376 / _stream.pyx:376.

Convert _init from @staticmethod cdef to @staticmethod def (matches the
Stream._init pattern) and drop the cdef declaration from the .pxd.
_init runs once per builder creation, so the loss of cdef-level
dispatch is irrelevant. Graph._init stays cdef; it is only called
intra-module.

Made-with: Cursor
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

Andy-Jost and others added 2 commits May 5, 2026 10:35
Every graph-builder test failed with CUDA_ERROR_INVALID_VALUE on the
new ``GraphBuilder.begin_building`` path. The driver rejects
``cuStreamGetCaptureInfo`` when ``captureStatus_out`` is NULL, but the
new ``_get_capture_info`` helper accepted a NULL status pointer and
``begin_building`` was calling it that way (it just wanted the freshly
captured graph handle and assumed the status was implied by the
preceding ``cuStreamBeginCapture``).

Pass a stack-local ``CUstreamCaptureStatus`` and document the helper's
requirement that ``status`` be non-NULL. ``graph`` is still allowed to
be NULL (``is_building`` calls it that way and the driver accepts it).

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

@Andy-Jost Andy-Jost requested a review from leofang May 5, 2026 17:55
@Andy-Jost Andy-Jost marked this pull request as ready for review May 5, 2026 17:55
Andy-Jost added a commit to Andy-Jost/cuda-python that referenced this pull request May 5, 2026
Completes step 3 of NVIDIA#1330 by exposing the captured graph as an explicit
`GraphDefinition` view that shares ownership of the underlying `CUgraph`.
The handle-layer plumbing landed in PR NVIDIA#2008; this commit wires up the
user-facing surface and locks in the state-guard rules.

State semantics:

- PRIMARY builder: only valid after `end_building()`. Before
  `begin_building()` no graph exists; during capture the driver is the
  sole writer, so explicit access is unsafe.
- CONDITIONAL_BODY builder: valid both before `begin_building()` (the
  body graph is allocated at conditional-node creation time) and after
  `end_building()`. This enables a hybrid flow where a conditional body
  is populated entirely via the explicit API, with no capture at all.
- FORKED builder: never valid. Forked builders share the primary's
  graph; access through the primary instead.

Tests cover the happy path, both hybrid flows on conditional bodies
(populate-via-explicit-API and capture-then-augment), the three error
states (forked, capturing, primary pre-capture), and the
shared-ownership guarantee (the `GraphDefinition` survives the
builder's `close()`).

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost force-pushed the graph-builder-refactor branch from f0b2e78 to 3c6c387 Compare May 5, 2026 19:15
Andy-Jost added a commit to Andy-Jost/cuda-python that referenced this pull request May 5, 2026
Completes step 3 of NVIDIA#1330 by exposing the captured graph as an explicit
`GraphDefinition` view that shares ownership of the underlying `CUgraph`.
The handle-layer plumbing landed in PR NVIDIA#2008; this commit wires up the
user-facing surface and locks in the state-guard rules.

State semantics:

- PRIMARY builder: only valid after `end_building()`. Before
  `begin_building()` no graph exists; during capture the driver is the
  sole writer, so explicit access is unsafe.
- CONDITIONAL_BODY builder: valid both before `begin_building()` (the
  body graph is allocated at conditional-node creation time) and after
  `end_building()`. This enables a hybrid flow where a conditional body
  is populated entirely via the explicit API, with no capture at all.
- FORKED builder: never valid. Forked builders share the primary's
  graph; access through the primary instead.

Tests cover the happy path, both hybrid flows on conditional bodies
(populate-via-explicit-API and capture-then-augment), the three error
states (forked, capturing, primary pre-capture), and the
shared-ownership guarantee (the `GraphDefinition` survives the
builder's `close()`).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant