cuda.core: keep kernel-argument objects alive in graph kernel nodes#2041
cuda.core: keep kernel-argument objects alive in graph kernel nodes#2041Andy-Jost wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
`GraphDefinition.launch()` did not extend the lifetime of the Python kernel-argument objects to the lifetime of the graph. The `ParamHolder` built in `GN_launch` held the only references to those objects and was destroyed when `GN_launch` returned. The driver only stores the raw pointer values in the kernel node, so a `Buffer` reachable only through the call could be GC'd before the graph ran, leaving the graph with a stale device pointer. Attach the `kernel_args` tuple to the graph as a CUDA user object, mirroring the existing handling of `KernelHandle` and `EventHandle`. This reuses the `_py_host_destructor` path already used by the host callback machinery. Closes NVIDIA#2039 Co-authored-by: Cursor <cursoragent@cursor.com>
e50c99e to
6654645
Compare
|
|
|
||
| del buf | ||
| gc.collect() | ||
| assert buf_weak() is not None # graph kept the Buffer alive |
There was a problem hiding this comment.
Test prove the buffer is kept alive, but it doesn't validate that its cleaned up after the graph is released.
There was a problem hiding this comment.
I added a test for this. If it is flakey, we might need to adjust the CU_USER_OBJECT_NO_DESTRUCTOR_SYNC flag so that graph destructors cannot be invoked asynchronously.
Update: I confirmed this is not a concern for source graphs. Asynchronous destruction only comes into play for exec graphs.
There was a problem hiding this comment.
This test creates an exec graph, so there is a race. CI for free-threaded Python seems more likely to trigger it. 9f2c8f2 adds polling, but removing the test would also be defensible.
rparolin
left a comment
There was a problem hiding this comment.
The tests should validate that buffer is eventually freed once the graph is refcount is decremented.
Addresses review feedback (PR NVIDIA#2041): the existing test only proved the graph kept the Buffer alive, not that the user-object machinery actually releases it once the graph is destroyed. Without the symmetric check, a working attachment is indistinguishable from a permanent leak. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Below are the Cursor GPT-5.4 Extra High Fast findings. It was thinking far longer than I'd have expected for a PR this size. I'm not sure which of these are actually actionable: Re 1. Do we care about stream-captured graphs?
|
|
Thanks @rwgk
I'll look into this. It might need to be deferred because AFAIK the stream capture path does not create any user objects.
We have a huge class of possible errors of this type, unfortunately. A better approach than storing the
I have to rework the whole user object design for #1330 (step 4) and I plan to address this. |
This is a good catch, and it is not fixed with this PR. (Which is why kernel arg update is so messy, as noted during the team sync today). I am fine with this PR only fixing the explicit graph construction path. |
The freeing assertion at the end of test_kernel_args_buffer_lifetime failed on free-threaded Python (py3.14t) because cuGraphExecDestroy releases its user-object references via an asynchronous DPC, and free- threaded CPython's deferred ref counting can need an extra GC pass to settle. Poll the weakref with a bounded timeout and per-iteration GC instead of asserting eagerly. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Closes #2039.
GraphDefinition.launch()did not extend the lifetime of Python kernel-argument objects (e.g.Buffer) to the lifetime of the graph. The ownership represented by aParamHolderconstructed inGN_launchneeds to be attached to the graph to avoid the possibility of stale arguments producing memory corruption or a crash on launch.Changes
cuda_core/cuda/core/graph/_graph_node.pyx: inGN_launch, attach thekernel_argstuple to the graph as a CUDA user object, mirroring the existing handling ofKernelHandleandEventHandle. Reuses the_py_host_destructorpath already used by the host-callback machinery.cuda_core/cuda/core/graph/_utils.pxd: expose_py_host_destructorso the new caller can use it.The new attachment runs only on the graph-construction path and is paid once per kernel node at build time, not at execution time. It does not affect the regular (non-graph) launch path in
_launcher.pyx.Test Coverage
Two tests added in
cuda_core/tests/graph/test_graph_definition_lifetime.py:test_kernel_args_buffer_kept_alive_through_execution: aBufferpassed as a kernel arg survivesdel buf+gc.collect()(weakref check) and the graph executes correctly against its memory after instantiation (value check).test_kernel_args_survive_graph_clone: same scenario but viacuGraphClone, which doesn't carry Python-level references — only CUDA user objects can keep the args alive across the clone.Related Work
_py_host_destructoragainst being invoked afterPy_Finalize. That is a pre-existing risk (also present on the host-callback path) that this PR inherits but does not introduce or widen.