Skip to content

Graph kernel nodes don't keep kernel argument objects alive #2039

@Andy-Jost

Description

@Andy-Jost

Summary

GraphDefinition.launch() does not extend the lifetime of Python kernel-argument objects to the lifetime of the graph. Inside GN_launch, a ParamHolder is built from the call's args tuple and holds the only strong references to those objects; when GN_launch returns, the ParamHolder is destroyed and its references drop.

The CUDA driver records the raw pointer values into the kernel node, but it does not (and cannot) retain the Python objects those pointers refer to. So if a kernel argument is reachable only through the .launch() call - for example, a Buffer passed inline - it can be garbage-collected before the graph runs. The graph then holds a stale device pointer, and instantiating/launching it produces undefined behavior: typically memory corruption or a crash.

KernelHandle and EventHandle are already retained via _attach_user_object for exactly this reason. Kernel arguments need the same treatment.

Reproducer

def test_kernel_args_buffer_not_kept_alive():
    """Buffer passed as a kernel arg is not kept alive by the graph.

    The ParamHolder built in GN_launch holds the only references to the
    kernel args and is destroyed when GN_launch returns. With no other
    reference to the Buffer, its device memory is freed and the graph
    is left with a stale pointer.
    """
    import ctypes
    import gc
    import weakref

    from cuda.bindings import driver as drv
    from cuda.core import Device, DeviceMemoryResource, LaunchConfig, Program, ProgramOptions
    from cuda.core.graph import GraphDefinition

    dev = Device()
    dev.set_current()
    stream = dev.create_stream()
    mr = DeviceMemoryResource(dev)

    arch = "".join(f"{i}" for i in dev.compute_capability)
    prog = Program(
        "__global__ void add_one(int *a) { *a += 1; }",
        code_type="c++",
        options=ProgramOptions(std="c++17", arch=f"sm_{arch}"),
    )
    add_one = prog.compile("cubin", name_expressions=("add_one",)).get_kernel("add_one")
    config = LaunchConfig(grid=1, block=1)

    buf = mr.allocate(ctypes.sizeof(ctypes.c_int), stream=stream)
    stream.sync()
    buf_weak = weakref.ref(buf)
    drv.cuMemcpyHtoD(int(buf.handle), (ctypes.c_int * 1)(0), ctypes.sizeof(ctypes.c_int))

    g = GraphDefinition()
    g.launch(config, add_one, buf)

    del buf
    gc.collect()
    assert buf_weak() is None  # nothing in the graph kept the Buffer alive

    g.instantiate().launch(stream)
    stream.sync()  # crash or memory corruption

Proposed fix

In GN_launch, attach the ParamHolder (or just its kernel_args tuple) to the graph as a CUDA user object, mirroring the existing handling of KernelHandle. That ties the Python-side argument lifetimes to the graph itself.

Metadata

Metadata

Assignees

Labels

P0High priority - Must do!bugSomething isn't workingcuda.coreEverything related to the cuda.core module

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions