Add persistent program cache for Program.compile#1912
Merged
leofang merged 34 commits intoNVIDIA:mainfrom May 6, 2026
Merged
Conversation
de57bd8 to
ac38a68
Compare
This comment has been minimized.
This comment has been minimized.
f1ae40e to
b27ed2c
Compare
The reader test only checked that every read returned non-None; a half-written file with non-empty bytes would pass. Carry the seeded payload into the worker, count exact-byte mismatches, and require zero. The eviction-race test wrote a final uncontested cache[key] = payload + b"final" after the churner exited, so the post-race endswith assertion would pass even with a broken stat-guard. Drop the final write and assert the entry survives carrying a rewriter payload prefix -- if the stat-guarded eviction path is broken, the in-race write is the one that vanishes.
… overwrite test InMemoryProgramCache claims thread-safety via the RLock that wraps every method but had no concurrent-thread coverage. Add a stress test with 4 writers + 4 readers x 200 ops against a size-capped cache that verifies no exceptions, no deadlocks (RLock reentrance through __setitem__ -> _evict_to_caps -> popitem), and that internal accounting (_total_bytes, len(_entries), len(cache)) stays consistent under contention. FileStreamProgramCache had no overwrite test analogous to test_inmemory_cache_overwrite_replaces_value_and_updates_size. Add one that writes a key twice and asserts the second value reads back, len stays at 1, and exactly one entry file lives on disk -- so a leaked entry from a botched os.replace would surface here.
max_size_bytes=0 used to slip past the >=0 guard but turned the cache into a black hole: every write was immediately evicted on its own size-cap pass. There is no legitimate use for that, so tighten the guard to >0 (or None for unbounded) and update both backends and the matching tests.
The (st_ino, st_size, st_mtime_ns) triple was open-coded in _touch_atime (fd-based and path-based fallbacks), _prune_if_stat_unchanged, and _enforce_size_cap. Centralise the fingerprint as _stat_key(st) so all four readers compare the same fields and the invariant has one place to read.
Replace, stat-and-read, and unlink each carried their own copy of the _REPLACE_RETRY_DELAYS / sleep / try-op / PermissionError loop. Centralise the loop as _with_sharing_retry(op, on_exhausted=...) and let each caller plug in its own success-on-success and exhausted-budget behaviour. Net behaviour is unchanged (exhaustion semantics for each public helper are preserved via the on_exhausted callback).
FileStreamProgramCache shards cache files into entries/<digest[:2]>/<digest[2:]>, so the overwrite test's iterdir filter on entries/ saw only the digest-prefix subdir (no is_file() match) and reported 0 entries. Switch to rglob so the assertion counts actual entry files. CI from the previous push caught this.
…iendly
`{code!r}` doubles every backslash on Windows, so the error string holds
`'C:\\Users\\...\\file'` while `str(src)` only has single
backslashes. Naive `str(path) in str(exc)` checks (and the
`test_filestream_cache_path_backed_object_code_missing_file_message`
assertion) failed on win-64 as a result. Drop the `!r` so the path
appears verbatim; the message still quotes it via the surrounding
sentence punctuation.
Avoid the per-write directory walk in `_enforce_size_cap` by maintaining a running byte total. The tracker is seeded once from `_compute_total_size` at open time, updated on `__setitem__` (net delta from the old entry's size), `__delitem__` (subtract the unlinked file's size), and `clear` (re-derive from the post-clear state). When a write doesn't push the running total above `max_size_bytes`, eviction stays a no-op -- writes become O(1) instead of O(n) in the cache size. Cross-process drift (other writers/deleters working on the same root) self-corrects: any time `_enforce_size_cap` actually runs its scan, it reseeds `_tracked_size_bytes` from the observed disk total, so overestimates trigger one extra scan and then settle. Skipped entirely when `max_size_bytes is None` (no cap, no need for a tracker). Mutations are guarded by `_size_lock` so multi-threaded writers in the same process don't interleave the read-modify-write on the int. Also switch `_iter_entry_paths` from `Path.iterdir` to `os.scandir`: the dirent type cache lets `is_dir`/`is_file` answer without a separate `stat` syscall on filesystems that report it (ext4, NTFS, ...). Behaviourally equivalent (the cache layout never creates symlinks, and we pass `follow_symlinks=False` to match `pathlib`'s previous no-symlink-confusion default for our paths). Tests cover three guarantees: (1) writes that stay under the cap don't call `_enforce_size_cap`, (2) writes that cross the cap do call it, and (3) external deletion behind the cache's back is reconciled by the next eviction pass.
0364eeb to
08f9f56
Compare
The test parsed `_program.pyx` to extract `SUPPORTED_TARGETS` and compare against `_SUPPORTED_TARGETS_BY_CODE_TYPE` in the cache. Source parsing for a duplication check is not worth the maintenance cost -- a reviewer eyeballing both definitions catches drift just as well, and the test was already broken once by upstream's StrEnum migration.
08f9f56 to
4aaca63
Compare
`os.scandir` returns an iterator that holds an OS directory handle (POSIX dirent fd, Windows FindFirstFileW); the context manager closes that handle deterministically. Using `with os.scandir(...) as outer` is the conventional spelling -- splitting the call from the `with` to land the FileNotFoundError catch on the call alone added one binding for no behavioural difference. Wrap each `with` in its own `try/except FileNotFoundError` instead. The inner loop is also a plain filter-and-yield over the DirEntry iterator, which is exactly what `yield from <generator-expr>` expresses.
The same scandir-then-stat-for-size loop appeared in three places (`_compute_total_size`, `_sweep_stale_tmp_files`, `_enforce_size_cap`). `_iter_tmp_entries` covers the scandir + is_file filter + context-managed cleanup that all three need. `_sum_tmp_sizes` covers the size accumulation that both `_compute_total_size` and `_enforce_size_cap` need. Net result: each callsite now reads as one obvious line instead of seven, and a future change to the temp-walk discipline (e.g. switching to a different scan strategy) lands in one place.
leofang
reviewed
May 6, 2026
Member
leofang
left a comment
There was a problem hiding this comment.
Re-review after the latest round of updates (33 commits). All 21 previous review comments have been addressed — nice work.
One new finding on the incremental size tracker (_tracked_size_bytes) that can affect correctness: a __delitem__ racing with _enforce_size_cap's re-seed can make the tracker go permanently negative, which disables all future eviction. Fix is a one-line max(0, ...) clamp. See inline comment for details.
Everything else looks solid. The per-instance _decide_driver() caching is correct, the key derivation is collision-free, the os.scandir migration is clean, and the test coverage has improved substantially.
`__delitem__` could walk `_tracked_size_bytes` negative under a race with `_enforce_size_cap`'s reseed: if the eviction scan runs AFTER this delete unlinks (so its reseed value excludes the deleted entry) but BEFORE this delete's subtract, the subtract undercounts by `size`. Repeated under contention, the tracker crosses zero -- and once negative, the `tracker > cap` check that gates eviction never fires again, so the cache grows without bound and there is no self-healing path (the only reseed point is the function that no longer runs). Clamp `tracker = max(0, tracker - size)` so the tracker can't enter the permanently-broken state. Worst case after the race is undercounting reality, which the next eviction's reseed corrects; that's the same self-healing path the existing tests already exercise. Reported by leofang in PR review.
c347f3c to
19d7cf7
Compare
leofang
approved these changes
May 6, 2026
Member
leofang
left a comment
There was a problem hiding this comment.
Thanks a lot, Phillip. Merging.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #176.
Closes #177.
Closes #178.
Closes #179.
Adds a persistent on-disk cache for
cuda.core.Program.compileoutputs. The high-level integration is one keyword onProgram.compile:A second invocation with the same inputs short-circuits the entire NVRTC compile —
cache.get(key)(one stat + one read) and anObjectCode._initfrom the bytes. NoProgram_compileis invoked. This is the fast path the cache exists to provide:Public API
Program.compile(target_type, *, cache=...)— convenience wrapper. Derives the key, returns a freshObjectCodeon hit, stores the compile output on miss.cuda.core.utils.ProgramCacheResource— abstract bytes-in / bytes-out interface for custom backends. Providesget,update(Mapping or pairs),clear, and the mapping mutators (__getitem__/__setitem__/__delitem__/__len__).__contains__is intentionally omitted:cache.get(key)is the recommended idiom because the two-callif key in cache: cache[key]pattern is racy across processes.cuda.core.utils.InMemoryProgramCache— single-process LRU onOrderedDict,threading.RLock, size-only cap. For "compile once, look up many" workflows that don't need persistence.cuda.core.utils.FileStreamProgramCache— directory of atomic per-entry files. Safe across processes viaos.replace+ Windows sharing-violation retries onos.replace/ read /unlink.cuda.core.utils.make_program_cache_key— escape hatch when the compile inputs require anextra_digest(include_path,pre_include,pch,use_pch,pch_dir, NVVMuse_libdevice=True, NVRTCoptions.namewith a directory component).Program.compile(cache=...)rejects those compiles with aValueErrorpointing here.On-disk format
Each entry is the raw compiled binary verbatim — cubin / PTX / LTO-IR — with no pickle, JSON, length prefix, or framing of any kind. Cache files are directly consumable by external NVIDIA tools (
cuobjdump,nvdisasm,cuda-gdb).ObjectCode.symbol_mappingfromname_expressionsis not preserved across a cache round-trip; the wrapper rejectsProgram.compile(name_expressions=..., cache=...)outright so the first-call-works/second-call-breaks footgun can't surface. Callers that needget_kernel(name_expression)should compile withoutcache=.FileStreamProgramCache
tmp/, fsync,os.replaceintoentries/<2char>/<hash>. Concurrent readers never observe partial writes. Windowsos.replaceretries onERROR_ACCESS_DENIED/ERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATION(winerrors 5/32/33) within a bounded backoff (~185 ms); after the budget, the write is dropped and the next call recompiles. The same retry covers reads andpath.unlinkso eviction doesn't crash the writer that triggered it on win-64._is_windows_sharing_violation(exc)filtersEACCESonly whenwinerroris absent — non-sharing winerrors are real config errors and propagate. Off-WindowsPermissionErroralways propagates.cache[key] = value(andcache.update({key: value, ...})) accept raw bytes, bytearray, memoryview, or anyObjectCode(path-backed too — the file is read at write time so the cached entry is the binary content, not a path that could move). Reads return the same bytes that went in.max_size_bytesis the only knob — no element-count cap.Nonemeans unbounded.os.utime(fd-based on Linux/macOS viaos.supports_fd, path-based on Windows) to bumpst_atimeregardless of mount options orNtfsDisableLastAccessUpdate. Eviction sorts by oldestst_atimefirst. The atime touch is stat-guarded so a racing rewriter's freshly-replaced file never has its mtime rolled back.clear(),_enforce_size_cap(), and the atime touch all snapshot(ino, size, mtime_ns)per entry and refuse to unlink / overwrite stamps if a writer replaced the file mid-operation.make_program_cache_key): a backend-strategy pattern with one class percode_type(_NvrtcBackend/_LinkerBackend/_NvvmBackend). Each owns its own validate / encode_code / option_fingerprint / encode_name_expressions / hash_version_probe / hash_extra_payload. The orchestrator validatescode_type/target_type, dispatches to the right backend, and assembles the digest in fixed order. Adding a new backend is one new class, not a five-place edit.options.namewith a directory component: rejected withoutextra_digestbecause NVRTC resolves quoted#includedirectives relative to that directory — neighbour-header changes wouldn't invalidate the cache otherwise.RuntimeWarningthe uncached path emits — loadability depends on the driver, not on whether the bytes were freshly compiled.pathis omitted, resolves viaplatformdirs.user_cache_path("cuda-python", appauthor=False, opinion=False) / "program-cache":\$XDG_CACHE_HOME/cuda-python/program-cache(default~/.cache/cuda-python/program-cache)~/Library/Caches/cuda-python/program-cache%LOCALAPPDATA%\\cuda-python\\program-cachetmp/self-heal: if something deletestmp/after the cache is opened, the next write recreates it rather than crashing withFileNotFoundError.Test plan
tests/test_program_cache.py— abstract-class contract,updateaccepts mapping or pairs, transparent input-form equivalence (bytes / bytearray / memoryview / bytes-backedObjectCode/ path-backedObjectCodeall round-trip to the same on-disk bytes),make_program_cache_keysemantics (deterministic, supported-target matrix mirrorsProgram.compile, backend probe failures fail closed but stable, env-version changes don't perturb the key on the wrong backends, options-fingerprint canonicalization for the linker path, side-effect / external-content / NVRTCoptions.name-dir-component guards, schema version mixing), filestream CRUD, atomic-write race coverage, stat-guarded prune / atime-touch / clear / size-cap, atime LRU promotes recently-read, default-dir usesplatformdirs,_is_windows_sharing_violationpredicate's truth table including the regression case (non-sharing winerror plus EACCES propagates),tmp/recreation after external wipe.tests/test_program_cache_multiprocess.py— concurrent writers same key, distinct keys, reader-vs-writer torn-file safety, size-cap eviction race (rewriter vs. churner) under stat-guarded eviction.tests/test_program_compile_cache.py—Program.compile(cache=...)miss/hit/error paths against a recording stub,name_expressionsrejection,extra_digest-required / side-effect / NVRTCoptions.name-dir-component rejection, PTX loadability warning on cache hit (positive + negative), real-NVRTC end-to-end roundtrip across reopen.