Summary
GraphDefinition.launch() does not extend the lifetime of Python kernel-argument objects to the lifetime of the graph. Inside GN_launch, a ParamHolder is built from the call's args tuple and holds the only strong references to those objects; when GN_launch returns, the ParamHolder is destroyed and its references drop.
The CUDA driver records the raw pointer values into the kernel node, but it does not (and cannot) retain the Python objects those pointers refer to. So if a kernel argument is reachable only through the .launch() call - for example, a Buffer passed inline - it can be garbage-collected before the graph runs. The graph then holds a stale device pointer, and instantiating/launching it produces undefined behavior: typically memory corruption or a crash.
KernelHandle and EventHandle are already retained via _attach_user_object for exactly this reason. Kernel arguments need the same treatment.
Reproducer
def test_kernel_args_buffer_not_kept_alive():
"""Buffer passed as a kernel arg is not kept alive by the graph.
The ParamHolder built in GN_launch holds the only references to the
kernel args and is destroyed when GN_launch returns. With no other
reference to the Buffer, its device memory is freed and the graph
is left with a stale pointer.
"""
import ctypes
import gc
import weakref
from cuda.bindings import driver as drv
from cuda.core import Device, DeviceMemoryResource, LaunchConfig, Program, ProgramOptions
from cuda.core.graph import GraphDefinition
dev = Device()
dev.set_current()
stream = dev.create_stream()
mr = DeviceMemoryResource(dev)
arch = "".join(f"{i}" for i in dev.compute_capability)
prog = Program(
"__global__ void add_one(int *a) { *a += 1; }",
code_type="c++",
options=ProgramOptions(std="c++17", arch=f"sm_{arch}"),
)
add_one = prog.compile("cubin", name_expressions=("add_one",)).get_kernel("add_one")
config = LaunchConfig(grid=1, block=1)
buf = mr.allocate(ctypes.sizeof(ctypes.c_int), stream=stream)
stream.sync()
buf_weak = weakref.ref(buf)
drv.cuMemcpyHtoD(int(buf.handle), (ctypes.c_int * 1)(0), ctypes.sizeof(ctypes.c_int))
g = GraphDefinition()
g.launch(config, add_one, buf)
del buf
gc.collect()
assert buf_weak() is None # nothing in the graph kept the Buffer alive
g.instantiate().launch(stream)
stream.sync() # crash or memory corruption
Proposed fix
In GN_launch, attach the ParamHolder (or just its kernel_args tuple) to the graph as a CUDA user object, mirroring the existing handling of KernelHandle. That ties the Python-side argument lifetimes to the graph itself.
Summary
GraphDefinition.launch()does not extend the lifetime of Python kernel-argument objects to the lifetime of the graph. InsideGN_launch, aParamHolderis built from the call'sargstuple and holds the only strong references to those objects; whenGN_launchreturns, theParamHolderis destroyed and its references drop.The CUDA driver records the raw pointer values into the kernel node, but it does not (and cannot) retain the Python objects those pointers refer to. So if a kernel argument is reachable only through the
.launch()call - for example, aBufferpassed inline - it can be garbage-collected before the graph runs. The graph then holds a stale device pointer, and instantiating/launching it produces undefined behavior: typically memory corruption or a crash.KernelHandleandEventHandleare already retained via_attach_user_objectfor exactly this reason. Kernel arguments need the same treatment.Reproducer
Proposed fix
In
GN_launch, attach theParamHolder(or just itskernel_argstuple) to the graph as a CUDA user object, mirroring the existing handling ofKernelHandle. That ties the Python-side argument lifetimes to the graph itself.