cuda.core: keep kernel-argument objects alive in graph kernel nodes by Andy-Jost · Pull Request #2041 · NVIDIA/cuda-python

Andy-Jost · 2026-05-06T21:21:44Z

Summary

Closes #2039.

GraphDefinition.launch() did not extend the lifetime of Python kernel-argument objects (e.g. Buffer) to the lifetime of the graph. The ownership represented by a ParamHolder constructed in GN_launch needs to be attached to the graph to avoid the possibility of stale arguments producing memory corruption or a crash on launch.

Changes

cuda_core/cuda/core/graph/_graph_node.pyx: in GN_launch, attach the kernel_args tuple to the graph as a CUDA user object, mirroring the existing handling of KernelHandle and EventHandle. Reuses the _py_host_destructor path already used by the host-callback machinery.
cuda_core/cuda/core/graph/_utils.pxd: expose _py_host_destructor so the new caller can use it.

The new attachment runs only on the graph-construction path and is paid once per kernel node at build time, not at execution time. It does not affect the regular (non-graph) launch path in _launcher.pyx.

Test Coverage

Two tests added in cuda_core/tests/graph/test_graph_definition_lifetime.py:

test_kernel_args_buffer_kept_alive_through_execution: a Buffer passed as a kernel arg survives del buf + gc.collect() (weakref check) and the graph executes correctly against its memory after instantiation (value check).
test_kernel_args_survive_graph_clone: same scenario but via cuGraphClone, which doesn't carry Python-level references — only CUDA user objects can keep the args alive across the clone.

Related Work

Closes Graph kernel nodes don't keep kernel argument objects alive #2039.
A follow-up issue will track hardening _py_host_destructor against being invoked after Py_Finalize. That is a pre-existing risk (also present on the host-callback path) that this PR inherits but does not introduce or widen.

`GraphDefinition.launch()` did not extend the lifetime of the Python kernel-argument objects to the lifetime of the graph. The `ParamHolder` built in `GN_launch` held the only references to those objects and was destroyed when `GN_launch` returned. The driver only stores the raw pointer values in the kernel node, so a `Buffer` reachable only through the call could be GC'd before the graph ran, leaving the graph with a stale device pointer. Attach the `kernel_args` tuple to the graph as a CUDA user object, mirroring the existing handling of `KernelHandle` and `EventHandle`. This reuses the `_py_host_destructor` path already used by the host callback machinery. Closes NVIDIA#2039 Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-06T21:48:48Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-2041/
https://nvidia.github.io/cuda-python/pr-preview/pr-2041/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-2041/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-2041/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rparolin · 2026-05-06T22:06:43Z

+
+    del buf
+    gc.collect()
+    assert buf_weak() is not None  # graph kept the Buffer alive


Test prove the buffer is kept alive, but it doesn't validate that its cleaned up after the graph is released.

I added a test for this. If it is flakey, we might need to adjust the CU_USER_OBJECT_NO_DESTRUCTOR_SYNC flag so that graph destructors cannot be invoked asynchronously.

Update: I confirmed this is not a concern for source graphs. Asynchronous destruction only comes into play for exec graphs.

This test creates an exec graph, so there is a race. CI for free-threaded Python seems more likely to trigger it. 9f2c8f2 adds polling, but removing the test would also be defensible.

rparolin

The tests should validate that buffer is eventually freed once the graph is refcount is decremented.

Addresses review feedback (PR NVIDIA#2041): the existing test only proved the graph kept the Buffer alive, not that the user-object machinery actually releases it once the graph is destroyed. Without the symmetric check, a working attachment is indistinguishable from a permanent leak. Co-authored-by: Cursor <cursoragent@cursor.com>

rwgk · 2026-05-06T22:22:15Z

Below are the Cursor GPT-5.4 Extra High Fast findings. It was thinking far longer than I'd have expected for a PR this size.

I'm not sure which of these are actually actionable:

Re 1. Do we care about stream-captured graphs?
Re 2. Could we simply document that we don't protect against explicit release?
Re 3. This seems OK to me? (I.e. I mean we can ignore this finding, or document the behavior?)

@Andy-Jost

High: Stream-captured graphs still use the unfixed launcher path

launch() still accepts a GraphBuilder and creates a stack-local ParamHolder in cuda_core/cuda/core/_launcher.pyx:23 and cuda_core/cuda/core/_launcher.pyx:47, but the new ownership transfer exists only in GN_launch() at cuda_core/cuda/core/graph/_graph_node.pyx:623. That means launch(gb, ..., buf) can still capture a kernel node whose Buffer or other Python owner is freed after capture, leaving the graph with a stale raw pointer. This is a real public path, not just a hypothetical one; existing tests exercise captured kernel launches with Buffer args in cuda_core/tests/graph/test_graph_memory_resource.py:171. The new regression tests only cover the explicit GraphDefinition path in cuda_core/tests/graph/test_graph_definition_lifetime.py:497 and cuda_core/tests/graph/test_graph_definition_lifetime.py:533.
High: Retaining the Buffer wrapper does not protect against explicit release

The PR now keeps ker_args.kernel_args alive at cuda_core/cuda/core/graph/_graph_node.pyx:623, but ParamHolder snapshots a Buffer argument as a raw device pointer value in cuda_core/cuda/core/_kernel_arg_handler.pyx:283 and cuda_core/cuda/core/_kernel_arg_handler.pyx:287. If user code explicitly calls buf.close() or exits a context manager, Buffer_close() resets _h_ptr and can free the allocation immediately at cuda_core/cuda/core/_memory/_buffer.pyx:596. At that point the graph still owns only the stale integer pointer copied during node creation; keeping the Python Buffer object alive does not keep the underlying allocation handle alive.
Medium: The new graph-scoped attachment can pin kernel-argument owners after node deletion

_attach_user_object() creates a CUDA user object and moves it into the graph at cuda_core/cuda/core/graph/_utils.pyx:41, but it does not return any handle that could later be released per node. GraphNode.destroy() only calls cuGraphDestroyNode() at cuda_core/cuda/core/graph/_graph_node.pyx:159. Because CUDA user objects are retained and released at graph scope (cuGraphRetainUserObject / cuGraphReleaseUserObject), deleting or rewiring a kernel node can now leave large kernel-argument owners, including Buffers, pinned until the entire graph is destroyed. Node mutation is already a supported workflow in cuda_core/tests/graph/test_graph_definition_mutation.py:159.

Andy-Jost · 2026-05-06T22:37:40Z

Thanks @rwgk

High: Stream-captured graphs still use the unfixed launcher path

I'll look into this. It might need to be deferred because AFAIK the stream capture path does not create any user objects.

High: Retaining the Buffer wrapper does not protect against explicit release

We have a huge class of possible errors of this type, unfortunately. A better approach than storing the Buffer would be to store the DevicePtrHandle (a std::shared_ptr owning the buffer). Definitely out of scope for this change.

Medium: The new graph-scoped attachment can pin kernel-argument owners after node deletion

I have to rework the whole user object design for #1330 (step 4) and I plan to address this.

leofang · 2026-05-06T22:38:51Z

High: Stream-captured graphs still use the unfixed launcher path

This is a good catch, and it is not fixed with this PR. (Which is why kernel arg update is so messy, as noted during the team sync today). I am fine with this PR only fixing the explicit graph construction path.

The freeing assertion at the end of test_kernel_args_buffer_lifetime failed on free-threaded Python (py3.14t) because cuGraphExecDestroy releases its user-object references via an asynchronous DPC, and free- threaded CPython's deferred ref counting can need an extra GC pass to settle. Poll the weakref with a bounded timeout and per-iteration GC instead of asserting eagerly. Co-authored-by: Cursor <cursoragent@cursor.com>

Andy-Jost added this to the cuda.core v1.0.0 milestone May 6, 2026

Andy-Jost added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels May 6, 2026

Andy-Jost self-assigned this May 6, 2026

Andy-Jost mentioned this pull request May 6, 2026

Harden _py_host_destructor against invocation after Py_Finalize #2042

Open

Andy-Jost force-pushed the ajost/graph-kernel-args-lifetime branch from e50c99e to 6654645 Compare May 6, 2026 21:29

rparolin reviewed May 6, 2026

View reviewed changes

rparolin requested changes May 6, 2026

View reviewed changes

rparolin approved these changes May 6, 2026

View reviewed changes

rwgk approved these changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda.core: keep kernel-argument objects alive in graph kernel nodes#2041

cuda.core: keep kernel-argument objects alive in graph kernel nodes#2041
Andy-Jost wants to merge 3 commits intoNVIDIA:mainfrom
Andy-Jost:ajost/graph-kernel-args-lifetime

Andy-Jost commented May 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 6, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rparolin May 6, 2026

Uh oh!

Andy-Jost May 6, 2026 •

edited

Loading

Uh oh!

Andy-Jost May 6, 2026

Uh oh!

rparolin left a comment

Uh oh!

rwgk commented May 6, 2026

Uh oh!

Andy-Jost commented May 6, 2026 •

edited

Loading

Uh oh!

leofang commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Andy-Jost commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Coverage

Related Work

Uh oh!

github-actions Bot commented May 6, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rparolin May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Andy-Jost May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost May 6, 2026

Choose a reason for hiding this comment

Uh oh!

rparolin left a comment

Choose a reason for hiding this comment

Uh oh!

rwgk commented May 6, 2026

Uh oh!

Andy-Jost commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Andy-Jost commented May 6, 2026 •

edited

Loading

Andy-Jost May 6, 2026 •

edited

Loading

Andy-Jost commented May 6, 2026 •

edited

Loading