CPU overhead optimizations for te autocast by vthumbe1503 · Pull Request #2957 · NVIDIA/TransformerEngine

vthumbe1503 · 2026-05-04T15:42:30Z

Description

te-autocast has quite a bit of CPU overheads on Grace Systems.
Here are the results on GB200 after the optimizations

Without Optimizations
Optimization1 --> Cache recipe string representation for getting unique autocast key. Also directly accessing the cached representation of recipe is 40 ns faster than going through a str(recipe)

Optimization2 --> Use enter, exit methods instead of using contextlib.contextmanager

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps · 2026-05-04T15:47:11Z

Greptile Summary

This PR reduces CPU overhead on Grace/GB200 systems by caching recipe __repr__ output and replacing the @contextmanager-based autocast with a class-based context manager that uses __slots__.

Recipe repr caching: MMParams/QParams build their repr eagerly in __post_init__; Recipe subclasses build it lazily on first __repr__ call via a new _make_repr hook, invalidated on any attribute mutation through a custom __setattr__.
Class-based autocast: Eliminates contextlib.GeneratorContextManager overhead (~0.5 µs per invocation) by providing __enter__/__exit__ directly; __slots__ avoids per-instance __dict__ allocation.
fp8_autocast wrapper: Updated to return the autocast instance directly instead of re-wrapping it in a generator, preserving backward compatibility for with fp8_autocast(...): usage.

Confidence Score: 5/5

The changes are straightforward performance optimizations with no functional regressions for the normal usage pattern.

Both changes (repr caching and class-based context manager) are behaviorally equivalent to the code they replace for all standard single-use patterns. The __setattr__ invalidation logic in Recipe is correct, and the __enter__/__exit__ implementation faithfully mirrors the original try/finally in the generator. The one gap — @abc.abstractmethod without ABCMeta — affects only future subclasses that forget to implement _make_repr, not any current code path.

No files require special attention beyond the minor abc.ABC inheritance gap in transformer_engine/common/recipe/__init__.py.

Important Files Changed

Filename	Overview
transformer_engine/common/recipe/init.py	Adds lazy repr caching to `Recipe` (via `__setattr__` + `_make_repr` pattern) and eager repr caching to `MMParams`/`QParams` via `__post_init__`. Renames existing `__repr__` methods to `_make_repr` in all subclasses and suppresses dataclass-generated repr with `repr=False`. Minor: `@abc.abstractmethod` on `_make_repr` is unenforced because `Recipe` doesn't use `ABCMeta`.
transformer_engine/pytorch/quantization.py	Converts `autocast` from a `@contextmanager` generator to a class-based context manager with `__slots__` for lower per-invocation overhead. `fp8_autocast` wrapper updated to return the `autocast` instance directly. `get_unique_autocast_key` changed to `@staticmethod` with a fast path reading `_cached_repr` from `__dict__`.

Sequence Diagram

sequenceDiagram
    participant User
    participant autocast
    participant FP8GlobalStateManager
    participant Recipe

    User->>autocast: __init__(enabled, recipe, ...)
    note over autocast: stores args, _fp8_state=None

    User->>autocast: __enter__()
    autocast->>Recipe: __repr__() [if needed for key]
    Recipe-->>autocast: cached _cached_repr
    autocast->>FP8GlobalStateManager: get_autocast_state()
    FP8GlobalStateManager-->>autocast: fp8_state snapshot
    autocast->>FP8GlobalStateManager: autocast_enter(enabled, recipe, ...)
    autocast-->>User: self

    note over User: training step

    User->>autocast: __exit__(exc_type, exc_val, exc_tb)
    autocast->>FP8GlobalStateManager: set_autocast_state(fp8_state)
    autocast->>FP8GlobalStateManager: autocast_exit(enabled, _graph)

_{Reviews (2): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-05-04T15:47:50Z

+        recipe_repr = recipe.__dict__.get("_cached_repr") if recipe is not None else None
+        if recipe_repr is None:
+            recipe_repr = str(recipe)
+        group_id = id(group) if group is not None else 0
+        return f"{recipe_repr}|{group_id}"


Key format change could produce ambiguous keys

The new key format f"{recipe_repr}|{group_id}" uses | as a separator without escaping. If a future recipe's __repr__ ever emits a | character, two distinct (recipe, group) pairs could map to the same string. The old str(tuple) format was unambiguous because it quoted the recipe repr. A safer pattern uses a separator that cannot appear in repr output, or encodes the parts deterministically.

Suggested change

recipe_repr = recipe.__dict__.get("_cached_repr") if recipe is not None else None

if recipe_repr is None:

recipe_repr = str(recipe)

group_id = id(group) if group is not None else 0

return f"{recipe_repr}|{group_id}"

group_id = id(group) if group is not None else None

return f"recipe=({str(recipe)}),group={group_id}"

greptile-apps · 2026-05-04T15:47:53Z

+    def __enter__(self) -> "autocast":
+        if self._enabled:
+            check_recipe_support(self._recipe)
+        # Save current state so we always restore it on exit.
+        self._fp8_state = FP8GlobalStateManager.get_autocast_state()
+        FP8GlobalStateManager.autocast_enter(
+            enabled=self._enabled,
+            calibrating=self._calibrating,
+            fp8_recipe=self._recipe,
+            fp8_group=self._amax_reduction_group,
+            _graph=self._graph,
+        )
+        return self

-    FP8GlobalStateManager.autocast_enter(
-        enabled=enabled,
-        calibrating=calibrating,
-        fp8_recipe=recipe,
-        fp8_group=amax_reduction_group,
-        _graph=_graph,
-    )
-    try:
-        yield
-    finally:
-        FP8GlobalStateManager.set_autocast_state(fp8_state)
-        FP8GlobalStateManager.autocast_exit(enabled, _graph=_graph)
+    def __exit__(self, exc_type, exc_val, exc_tb) -> None:
+        FP8GlobalStateManager.set_autocast_state(self._fp8_state)
+        FP8GlobalStateManager.autocast_exit(self._enabled, _graph=self._graph)
+        # Do not suppress exceptions.
+        return None


Nested reuse of the same instance silently corrupts state

The old generator-based implementation raised RuntimeError: generator already executing if you tried to enter the same context manager object twice concurrently. The new class-based implementation silently accepts nested reuse, but the second __enter__ call overwrites self._fp8_state with the state captured inside the first context, so the outer __exit__ restores the wrong state permanently.

ctx = autocast(enabled=True, recipe=recipe) with ctx: # _fp8_state = pre_context_state with ctx: # _fp8_state = state_inside_first_block ← overwrites! pass # __exit__: restores state_inside_first_block # _fp8_state is now state_inside_first_block # __exit__: restores state_inside_first_block, NOT pre_context_state ← bug

Adding a guard in __enter__ would preserve the old safety behavior:

def __enter__(self) -> "autocast": if self._fp8_state is not None: raise RuntimeError("autocast context manager cannot be entered more than once concurrently") ...

greptile-apps · 2026-05-04T15:47:56Z

+    def __repr__(self) -> str:
+        cached = self.__dict__.get("_cached_repr")
+        if cached is not None:
+            return cached
+        result = f"MMParams(use_split_accumulator={self.use_split_accumulator})"
+        object.__setattr__(self, "_cached_repr", result)
+        return result


_cached_repr stored outside declared dataclass fields

MMParams is @dataclass(frozen=True). Storing _cached_repr via object.__setattr__ bypasses the frozen guard correctly in CPython, but _cached_repr is not a declared dataclass field — it won't appear in dataclasses.fields(), dataclasses.asdict(), dataclasses.astuple(), or copy.replace(). If downstream code serializes or copies an MMParams instance, the cached repr would be lost silently. Documenting this with a comment or declaring it as field(init=False, repr=False, compare=False) would make the intent clearer. The same applies to QParams.

I see that this is why we're doing the funny accesses with __dict__. I agree that bypassing frozen=True is iffy, so I wonder if we could set _cached_repr in __post_init__? If the class is frozen, its repr must also be frozen and I don't see a benefit in lazy evaluation.

timmoon10 · 2026-05-04T18:02:09Z

+    def __repr__(self) -> str:
+        cached = self.__dict__.get("_cached_repr")
+        if cached is not None:
+            return cached
+        result = f"MMParams(use_split_accumulator={self.use_split_accumulator})"
+        object.__setattr__(self, "_cached_repr", result)
+        return result


I see that this is why we're doing the funny accesses with __dict__. I agree that bypassing frozen=True is iffy, so I wonder if we could set _cached_repr in __post_init__? If the class is frozen, its repr must also be frozen and I don't see a benefit in lazy evaluation.

timmoon10 · 2026-05-04T18:26:24Z

+    # changes. This makes repeated ``str(recipe)`` calls (e.g. on the hot
+    # path in ``FP8GlobalStateManager.get_unique_autocast_key``) essentially
+    # free after the first call.
+    _cached_repr: Optional[str] = None


Three problems:

_cached_repr is being set as a class attr, not an instance attr.

Accessing _cached_repr via __dict__ is non-standard and bug-prone.

Splitting the cache logic between the base class and child classes results in code duplication and more risk of bugs, especially if it involves non-standard __dict__ accesses.

What if we concentrated the caching logic in the base class:

class Recipe: def __init__(self) -> None: self._cached_repr: Optional[str] = None @abc.abstractmethod def _make_repr(self) -> str: ... def __repr__(self) -> str: if self._cached_repr is None: self._cached_repr = self._make_repr() return self._cached_repr ... class DelayedScaling(Recipe): def _make_repr(self) -> str: return f"..."

timmoon10 · 2026-05-04T18:30:05Z

+        # directly getting the cached repr is about 40 ns faster than str(recipe)
+        # on grace systems.


This is good to mention in the PR description, but not that useful in the code itself. Profiling becomes outdated once we move on to the next architecture.

timmoon10 · 2026-05-04T18:32:58Z

+        recipe_repr = recipe.__dict__.get("_cached_repr") if recipe is not None else None
+        if recipe_repr is None:
+            recipe_repr = str(recipe)
+        group_id = id(group) if group is not None else 0
+        return f"{recipe_repr}|{group_id}"


Suggested change

recipe_repr = recipe.__dict__.get("_cached_repr") if recipe is not None else None

if recipe_repr is None:

recipe_repr = str(recipe)

group_id = id(group) if group is not None else 0

return f"{recipe_repr}|{group_id}"

group_id = id(group) if group is not None else None

return f"recipe=({str(recipe)}),group={group_id}"

timmoon10 · 2026-05-04T18:46:00Z

+    # Class-based context manager (instead of ``@contextmanager`` from contextlib)
+    # to avoid the ~0.5us / invocation overhead of contextlib's generator-driven
+    # ``GeneratorContextManager``. ``__slots__`` further avoids per-instance
+    # dict allocation.


Why are we mentioning the context manager here? It makes sense for this PR, but once the code is merged it will be completely random. This comment should explain what we are doing with __slots__, and we should explain the custom context manager logic in __enter__ and __exit__.

timmoon10 · 2026-05-04T18:51:53Z

+        # Do not suppress exceptions.
+        return None


Nit: The function already returns None and the comment is trivially true (all Python outside of a try statement is not suppressing exceptions).

Suggested change

# Do not suppress exceptions.

return None

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

cpu optimizations for te autocast

49fa7f0

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps Bot reviewed May 4, 2026

View reviewed changes

timmoon10 reviewed May 4, 2026

View reviewed changes

vthumbe1503 and others added 2 commits May 8, 2026 02:58

address some review comments

b709115

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f3229ea

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU overhead optimizations for te autocast#2957

CPU overhead optimizations for te autocast#2957
vthumbe1503 wants to merge 3 commits intoNVIDIA:mainfrom
vthumbe1503:cpu_opt_te_autocast

vthumbe1503 commented May 4, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 4, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 4, 2026

Uh oh!

timmoon10 May 4, 2026

Uh oh!

greptile-apps Bot May 4, 2026

Uh oh!

greptile-apps Bot May 4, 2026

Uh oh!

timmoon10 May 4, 2026

Uh oh!

timmoon10 May 4, 2026

Uh oh!

timmoon10 May 4, 2026

Uh oh!

timmoon10 May 4, 2026

Uh oh!

timmoon10 May 4, 2026

Uh oh!

timmoon10 May 4, 2026

Uh oh!

timmoon10 May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# directly getting the cached repr is about 40 ns faster than str(recipe)
		# on grace systems.

Conversation

vthumbe1503 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vthumbe1503 commented May 4, 2026 •

edited

Loading

greptile-apps Bot commented May 4, 2026 •

edited

Loading