Incorrect GroupNormalization result of TensorRT 10.16.1.11 when running ONNX GroupNormalization(num_groups=1) on GPU

## Description

TensorRT appears to handle ONNX `GroupNormalization` with `num_groups=1` incorrectly.

For `GroupNormalization(num_groups=1)`, the normalization should be computed across all channels and spatial dimensions in each sample, matching PyTorch `torch.nn.functional.group_norm(..., num_groups=1)` and ONNX Runtime.

However, TensorRT produces an output that matches per-channel instance normalization instead. In other words, TensorRT seems to normalize each channel independently, as if `num_groups=C`, rather than treating all channels as one group.

The same ONNX model runs correctly in ONNX Runtime and matches PyTorch.

## Environment

TensorRT Version: 10.16.1.11

NVIDIA GPU: N/A / not detected by nvidia-smi

NVIDIA Driver Version: N/A / nvidia-smi failed

CUDA Version: N/A / nvcc not found

CUDNN Version: N/A / torch.backends.cudnn.version() returned None

Operating System: Linux 6.17.0-20-generic x86_64, glibc 2.39

Python Version (if applicable): Python 3.11.15

Tensorflow Version (if applicable): N/A

PyTorch Version (if applicable): 2.11.0+cpu

Baremetal or Container (if so, version): Baremetal / non-Docker environment (/proc/1/cgroup: 0::/init.scope)

Additional package versions:

ONNX Version: 1.21.0
ONNX Runtime Version: 1.25.1


## Relevant Files

**Model link**: N/A

The ONNX model is generated inline by the minimal reproducible script below.

## Steps To Reproduce

```
import numpy as np
import onnx
from onnx import helper, TensorProto
import torch
import onnxruntime as ort
from _trt_helper import build_engine_from_onnx, run_engine

C, H, W = 8, 4, 4
X = helper.make_tensor_value_info("X", TensorProto.FLOAT, [1, C, H, W])
Y = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [1, C, H, W])

scale = helper.make_tensor("scale", TensorProto.FLOAT, [C], np.ones(C, np.float32))
bias = helper.make_tensor("bias", TensorProto.FLOAT, [C], np.zeros(C, np.float32))

node = helper.make_node(
    "GroupNormalization",
    ["X", "scale", "bias"],
    ["Y"],
    num_groups=1,
    epsilon=1e-5,
)

g = helper.make_graph([node], "g", [X], [Y], initializer=[scale, bias])
m = helper.make_model(g, opset_imports=[helper.make_opsetid("", 21)])
m.ir_version = 10
onnx.checker.check_model(m)
onnx_bytes = m.SerializeToString()

x = np.random.default_rng(0).standard_normal((1, C, H, W)).astype(np.float32)

eng, _ = build_engine_from_onnx(onnx_bytes, fp16=False)
trt_out = run_engine(
    eng,
    {"X": x},
    ["Y"],
    [(1, C, H, W)],
    [np.float32],
)["Y"]

torch_out = torch.nn.functional.group_norm(
    torch.from_numpy(x),
    1,
    weight=torch.ones(C),
    bias=torch.zeros(C),
    eps=1e-5,
).numpy()

ort_out = ort.InferenceSession(
    onnx_bytes,
    providers=["CPUExecutionProvider"],
).run(["Y"], {"X": x})[0]

instance_out = np.zeros_like(x)
for c in range(C):
    instance_out[0, c] = (
        x[0, c] - x[0, c].mean()
    ) / np.sqrt(x[0, c].var() + 1e-5)

print("TRT[0,0,0,:4]:     ", trt_out[0, 0, 0, :4])
print("torch[0,0,0,:4]:   ", torch_out[0, 0, 0, :4])
print("ORT[0,0,0,:4]:     ", ort_out[0, 0, 0, :4])
print("instance[0,0,0,:4]:", instance_out[0, 0, 0, :4])
print("max|TRT - torch|:", float(np.max(np.abs(trt_out - torch_out))))
print("max|TRT - ORT|:  ", float(np.max(np.abs(trt_out - ort_out))))
print("max|TRT - instance|:", float(np.max(np.abs(trt_out - instance_out))))

assert np.max(np.abs(trt_out - torch_out)) > 1e-2
assert np.max(np.abs(trt_out - instance_out)) < 1e-4
```

**Commands or scripts**:

**Have you tried [the latest release](https://developer.nvidia.com/tensorrt)?**: Yes, reproduced with TensorRT 10.16.1.11.

**Attach the captured .json and .bin files from [TensorRT's API Capture tool](https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/capture-replay.html) if you're on an x86_64 Unix system** Not attached. The issue is reproducible from the self-contained Python script above.

**Can this model run on other frameworks?** For example run ONNX model with ONNXRuntime (`polygraphy run <model.onnx> --onnxrt`): Yes. ONNX Runtime and PyTorch agree with each other. TensorRT differs from both and instead matches per-channel instance normalization.

### Actual output:

```
TRT[0,0,0,:4]:      [0.4456725  0.15238671 1.031132   0.4219784 ]
torch[0,0,0,:4]:    [ 0.06682756 -0.2042495   0.6079537   0.04492765]
ORT[0,0,0,:4]:      [ 0.06682757 -0.20424952  0.6079538   0.04492767]
instance[0,0,0,:4]: [0.4456726  0.15238675 1.0311322  0.4219785 ]
max|TRT - torch|: 0.6626157760620117
max|TRT - ORT|:   0.6626157760620117
max|TRT - instance|: 2.384185791015625e-07
```

This suggests TensorRT is treating GroupNormalization(num_groups=1) like instance normalization instead of normalizing over the single group containing all channels.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect GroupNormalization result of TensorRT 10.16.1.11 when running ONNX GroupNormalization(num_groups=1) on GPU #4756

Description

Environment

Relevant Files

Steps To Reproduce

Actual output:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect GroupNormalization result of TensorRT 10.16.1.11 when running ONNX GroupNormalization(num_groups=1) on GPU #4756

Description

Description

Environment

Relevant Files

Steps To Reproduce

Actual output:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions