Skip to content

cuda.core: keep kernel-argument objects alive in graph kernel nodes#2041

Open
Andy-Jost wants to merge 3 commits intoNVIDIA:mainfrom
Andy-Jost:ajost/graph-kernel-args-lifetime
Open

cuda.core: keep kernel-argument objects alive in graph kernel nodes#2041
Andy-Jost wants to merge 3 commits intoNVIDIA:mainfrom
Andy-Jost:ajost/graph-kernel-args-lifetime

Conversation

@Andy-Jost
Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost commented May 6, 2026

Summary

Closes #2039.

GraphDefinition.launch() did not extend the lifetime of Python kernel-argument objects (e.g. Buffer) to the lifetime of the graph. The ownership represented by a ParamHolder constructed in GN_launch needs to be attached to the graph to avoid the possibility of stale arguments producing memory corruption or a crash on launch.

Changes

  • cuda_core/cuda/core/graph/_graph_node.pyx: in GN_launch, attach the kernel_args tuple to the graph as a CUDA user object, mirroring the existing handling of KernelHandle and EventHandle. Reuses the _py_host_destructor path already used by the host-callback machinery.
  • cuda_core/cuda/core/graph/_utils.pxd: expose _py_host_destructor so the new caller can use it.

The new attachment runs only on the graph-construction path and is paid once per kernel node at build time, not at execution time. It does not affect the regular (non-graph) launch path in _launcher.pyx.

Test Coverage

Two tests added in cuda_core/tests/graph/test_graph_definition_lifetime.py:

  • test_kernel_args_buffer_kept_alive_through_execution: a Buffer passed as a kernel arg survives del buf + gc.collect() (weakref check) and the graph executes correctly against its memory after instantiation (value check).
  • test_kernel_args_survive_graph_clone: same scenario but via cuGraphClone, which doesn't carry Python-level references — only CUDA user objects can keep the args alive across the clone.

Related Work

@Andy-Jost Andy-Jost added this to the cuda.core v1.0.0 milestone May 6, 2026
@Andy-Jost Andy-Jost added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels May 6, 2026
@Andy-Jost Andy-Jost self-assigned this May 6, 2026
`GraphDefinition.launch()` did not extend the lifetime of the Python
kernel-argument objects to the lifetime of the graph. The `ParamHolder`
built in `GN_launch` held the only references to those objects and was
destroyed when `GN_launch` returned. The driver only stores the raw
pointer values in the kernel node, so a `Buffer` reachable only through
the call could be GC'd before the graph ran, leaving the graph with a
stale device pointer.

Attach the `kernel_args` tuple to the graph as a CUDA user object,
mirroring the existing handling of `KernelHandle` and `EventHandle`.
This reuses the `_py_host_destructor` path already used by the host
callback machinery.

Closes NVIDIA#2039

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost force-pushed the ajost/graph-kernel-args-lifetime branch from e50c99e to 6654645 Compare May 6, 2026 21:29
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026


del buf
gc.collect()
assert buf_weak() is not None # graph kept the Buffer alive
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test prove the buffer is kept alive, but it doesn't validate that its cleaned up after the graph is released.

Copy link
Copy Markdown
Contributor Author

@Andy-Jost Andy-Jost May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for this. If it is flakey, we might need to adjust the CU_USER_OBJECT_NO_DESTRUCTOR_SYNC flag so that graph destructors cannot be invoked asynchronously.

Update: I confirmed this is not a concern for source graphs. Asynchronous destruction only comes into play for exec graphs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test creates an exec graph, so there is a race. CI for free-threaded Python seems more likely to trigger it. 9f2c8f2 adds polling, but removing the test would also be defensible.

Copy link
Copy Markdown
Collaborator

@rparolin rparolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests should validate that buffer is eventually freed once the graph is refcount is decremented.

Addresses review feedback (PR NVIDIA#2041): the existing test only proved the
graph kept the Buffer alive, not that the user-object machinery actually
releases it once the graph is destroyed. Without the symmetric check, a
working attachment is indistinguishable from a permanent leak.

Co-authored-by: Cursor <cursoragent@cursor.com>
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 6, 2026

Below are the Cursor GPT-5.4 Extra High Fast findings. It was thinking far longer than I'd have expected for a PR this size.

I'm not sure which of these are actually actionable:

Re 1. Do we care about stream-captured graphs?
Re 2. Could we simply document that we don't protect against explicit release?
Re 3. This seems OK to me? (I.e. I mean we can ignore this finding, or document the behavior?)

@Andy-Jost


  1. High: Stream-captured graphs still use the unfixed launcher path

    launch() still accepts a GraphBuilder and creates a stack-local ParamHolder in cuda_core/cuda/core/_launcher.pyx:23 and cuda_core/cuda/core/_launcher.pyx:47, but the new ownership transfer exists only in GN_launch() at cuda_core/cuda/core/graph/_graph_node.pyx:623. That means launch(gb, ..., buf) can still capture a kernel node whose Buffer or other Python owner is freed after capture, leaving the graph with a stale raw pointer. This is a real public path, not just a hypothetical one; existing tests exercise captured kernel launches with Buffer args in cuda_core/tests/graph/test_graph_memory_resource.py:171. The new regression tests only cover the explicit GraphDefinition path in cuda_core/tests/graph/test_graph_definition_lifetime.py:497 and cuda_core/tests/graph/test_graph_definition_lifetime.py:533.

  2. High: Retaining the Buffer wrapper does not protect against explicit release

    The PR now keeps ker_args.kernel_args alive at cuda_core/cuda/core/graph/_graph_node.pyx:623, but ParamHolder snapshots a Buffer argument as a raw device pointer value in cuda_core/cuda/core/_kernel_arg_handler.pyx:283 and cuda_core/cuda/core/_kernel_arg_handler.pyx:287. If user code explicitly calls buf.close() or exits a context manager, Buffer_close() resets _h_ptr and can free the allocation immediately at cuda_core/cuda/core/_memory/_buffer.pyx:596. At that point the graph still owns only the stale integer pointer copied during node creation; keeping the Python Buffer object alive does not keep the underlying allocation handle alive.

  3. Medium: The new graph-scoped attachment can pin kernel-argument owners after node deletion

    _attach_user_object() creates a CUDA user object and moves it into the graph at cuda_core/cuda/core/graph/_utils.pyx:41, but it does not return any handle that could later be released per node. GraphNode.destroy() only calls cuGraphDestroyNode() at cuda_core/cuda/core/graph/_graph_node.pyx:159. Because CUDA user objects are retained and released at graph scope (cuGraphRetainUserObject / cuGraphReleaseUserObject), deleting or rewiring a kernel node can now leave large kernel-argument owners, including Buffers, pinned until the entire graph is destroyed. Node mutation is already a supported workflow in cuda_core/tests/graph/test_graph_definition_mutation.py:159.

@Andy-Jost
Copy link
Copy Markdown
Contributor Author

Andy-Jost commented May 6, 2026

Thanks @rwgk

  1. High: Stream-captured graphs still use the unfixed launcher path

I'll look into this. It might need to be deferred because AFAIK the stream capture path does not create any user objects.

  1. High: Retaining the Buffer wrapper does not protect against explicit release

We have a huge class of possible errors of this type, unfortunately. A better approach than storing the Buffer would be to store the DevicePtrHandle (a std::shared_ptr owning the buffer). Definitely out of scope for this change.

  1. Medium: The new graph-scoped attachment can pin kernel-argument owners after node deletion

I have to rework the whole user object design for #1330 (step 4) and I plan to address this.

@leofang
Copy link
Copy Markdown
Member

leofang commented May 6, 2026

High: Stream-captured graphs still use the unfixed launcher path

This is a good catch, and it is not fixed with this PR. (Which is why kernel arg update is so messy, as noted during the team sync today). I am fine with this PR only fixing the explicit graph construction path.

The freeing assertion at the end of test_kernel_args_buffer_lifetime
failed on free-threaded Python (py3.14t) because cuGraphExecDestroy
releases its user-object references via an asynchronous DPC, and free-
threaded CPython's deferred ref counting can need an extra GC pass to
settle. Poll the weakref with a bounded timeout and per-iteration GC
instead of asserting eagerly.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Graph kernel nodes don't keep kernel argument objects alive

4 participants