Helm chart pins ClickHouse 25.6.1, which has a memory-tracker overflow that breaks the dashboard until pod restart

Hey guys...this one's a doosy, I'm sorry. Trying to push us to use AI more (whom this issue's author clearly is), and frankly, I never would have found or understood this one otherwise. Airing on the side of too much info rather than too little.

## Summary

The official Helm chart (`oci://ghcr.io/triggerdotdev/charts/trigger`) pins the Bitnami ClickHouse subchart to `clickhouse-9.3.7`, which in turn pins `bitnami/clickhouse:25.6.1-debian-12-r0`. Under sustained ingest, ClickHouse 25.6.1 hits a memory-tracker accounting bug that causes the global memory counter to overflow to ~7 EiB (≈ `2^63`), at which point every query — reads and writes — is rejected by `OvercommitTracker` until the pod is restarted. The trigger.dev webapp dashboard surfaces this as **"Unable to load your task runs"**, and event/run telemetry stops being persisted.

The condition is not self-clearing. The host has plenty of free memory (RSS ~2 GiB out of a 21.6 GiB limit), but ClickHouse's internal accounting is wedged. Only an in-process restart resets it.

## Environment

- Trigger.dev Helm chart: `4.0.5` (also reproduced on `4.4.5` since both pin the same subchart — see below)
- App image: `ghcr.io/triggerdotdev/trigger.dev:v4.4.4`
- Kubernetes: EKS, single-shard ClickHouse statefulset (`trigger-clickhouse-shard0-0`), ~20 GiB PVC
- ClickHouse resources: requests `13.5Gi` / limits `24Gi` (chart values block)
- Workload: typical Trigger.dev v4 ingest — task runs + trace events flowing through `task_runs_v2`, `task_events_v2`, and the `raw_task_runs_payload_v1` staging table

## Reproduction

This was a production incident, not an isolated synthetic repro, but the trigger seems to be sustained write pressure with concurrent background merges on `raw_task_runs_payload_v1`. Once the merge tasks start failing with `MEMORY_LIMIT_EXCEEDED`, retries pile up and never recover.

## What we observed

ClickHouse pod logs (representative, repeating thousands of times per minute):

```
Code: 241. DB::Exception: (total) memory limit exceeded:
  would use 7.00 EiB (attempt to allocate chunk of 4.00 MiB bytes),
  current RSS: 2.12 GiB, maximum: 21.60 GiB.
  OvercommitTracker decision: Query was selected to stop by OvercommitTracker.
  (MEMORY_LIMIT_EXCEEDED)
... while reading from part .../raw_task_runs_payload_v1/...
... in query: INSERT INTO trigger_dev.task_runs_v2 ...
... in query: INSERT INTO trigger_dev.task_events_v2 ...
```

The `7.00 EiB` figure is the giveaway — that's `~2^63` bytes, i.e. a signed-integer overflow in ClickHouse's global memory tracker. The actual RSS is ~2 GiB.

Webapp logs (during the event):

```
EventRepo.DynamicFlushScheduler  Error attempting to flush batch
  consecutiveFailures: 19438
  table: trigger_dev.task_events_v2
  error: InsertError: (total) memory limit exceeded ... 7.00 EiB ...
```

The webapp accumulated ~1.75M backlogged events that had to drain after we restarted ClickHouse. UI showed "Unable to load your task runs" because the dashboard's reads against `task_runs_v2` were rejected by the same tracker.

## Root cause (best assessment)

This is a known class of ClickHouse memory-tracker accounting bug — a `free()` is double-counted (or an `alloc()` is missed) in one of the hot paths (background merges, async inserts, materialized views), the global atomic counter goes negative, the unsigned interpretation looks like ~9 EiB, and `OvercommitTracker` rejects every subsequent allocation. There are multiple upstream ClickHouse commits in 25.7+ touching memory-tracker accuracy and overflow handling.

## Mitigation (workaround)

```
kubectl rollout restart statefulset/trigger-clickhouse-shard0 -n trigger
```

Recovery is immediate — error rate goes from `~11k/min` to `~0` within seconds of the pod coming back up. Webapp `consecutiveFailures` drops from `~47k` to single digits as soon as ClickHouse is reachable again.

## Suggested resolution (chart-side)

The trigger Helm chart's Bitnami CH subchart pin (`charts/clickhouse-9.3.7` → `bitnami/clickhouse:25.6.1-debian-12-r0`) hasn't moved since `trigger@4.0.5` and is still the same in `trigger@4.4.5`. Bumping the Bitnami subchart pin in `hosting/k8s/helm/Chart.yaml` to a current version (latest Bitnami CH chart on `main` ships `25.7.5-debian-12-r0`) would pull in upstream tracker fixes for everyone running self-hosted, without each operator having to override `clickhouse.image.tag` themselves.

Operators currently on `trigger@4.0.5`–`4.4.5` are exposed to this regardless of chart version, since the subchart pin is unchanged.

## Suggested resolution (operational hardening, optional)

A few small additions would make this much less painful even if the underlying bug isn't fully fixed:

1. **A liveness probe that exercises the query path** (e.g. `clickhouse-client -q "SELECT 1"`) on the bundled CH StatefulSet. The current TCP/HTTP probe stays green when the pod is wedged — a query-based probe would let Kubernetes auto-restart the pod within a minute. Today an operator has to notice the user-visible failure first.
2. **Webapp circuit breaker / backlog cap on `EventRepo.DynamicFlushScheduler`.** When `consecutiveFailures` crosses some threshold, sample/drop trace events instead of accumulating millions of items in memory. Losing 5 minutes of partial trace data is preferable to a 1.75M-item backlog that takes hours to drain post-recovery and increases the chance of re-tripping the tracker.
3. **Document the failure mode and recovery in the self-hosting docs**, so other operators recognize "Unable to load your task runs" + `MEMORY_LIMIT_EXCEEDED ... 7 EiB` as a single condition with a known one-line fix.

Happy to test a chart bump on our end and report back, or open a small PR against `hosting/k8s/helm/` for the subchart bump if helpful.

— Filed by an operator running self-hosted Trigger.dev v4 on EKS


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Helm chart pins ClickHouse 25.6.1, which has a memory-tracker overflow that breaks the dashboard until pod restart #3520

Summary

Environment

Reproduction

What we observed

Root cause (best assessment)

Mitigation (workaround)

Suggested resolution (chart-side)

Suggested resolution (operational hardening, optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Helm chart pins ClickHouse 25.6.1, which has a memory-tracker overflow that breaks the dashboard until pod restart #3520

Description

Summary

Environment

Reproduction

What we observed

Root cause (best assessment)

Mitigation (workaround)

Suggested resolution (chart-side)

Suggested resolution (operational hardening, optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions