Skip to content

Helm chart pins ClickHouse 25.6.1, which has a memory-tracker overflow that breaks the dashboard until pod restart #3520

@brentshulman-silkline

Description

@brentshulman-silkline

Hey guys...this one's a doosy, I'm sorry. Trying to push us to use AI more (whom this issue's author clearly is), and frankly, I never would have found or understood this one otherwise. Airing on the side of too much info rather than too little.

Summary

The official Helm chart (oci://ghcr.io/triggerdotdev/charts/trigger) pins the Bitnami ClickHouse subchart to clickhouse-9.3.7, which in turn pins bitnami/clickhouse:25.6.1-debian-12-r0. Under sustained ingest, ClickHouse 25.6.1 hits a memory-tracker accounting bug that causes the global memory counter to overflow to ~7 EiB (≈ 2^63), at which point every query — reads and writes — is rejected by OvercommitTracker until the pod is restarted. The trigger.dev webapp dashboard surfaces this as "Unable to load your task runs", and event/run telemetry stops being persisted.

The condition is not self-clearing. The host has plenty of free memory (RSS ~2 GiB out of a 21.6 GiB limit), but ClickHouse's internal accounting is wedged. Only an in-process restart resets it.

Environment

  • Trigger.dev Helm chart: 4.0.5 (also reproduced on 4.4.5 since both pin the same subchart — see below)
  • App image: ghcr.io/triggerdotdev/trigger.dev:v4.4.4
  • Kubernetes: EKS, single-shard ClickHouse statefulset (trigger-clickhouse-shard0-0), ~20 GiB PVC
  • ClickHouse resources: requests 13.5Gi / limits 24Gi (chart values block)
  • Workload: typical Trigger.dev v4 ingest — task runs + trace events flowing through task_runs_v2, task_events_v2, and the raw_task_runs_payload_v1 staging table

Reproduction

This was a production incident, not an isolated synthetic repro, but the trigger seems to be sustained write pressure with concurrent background merges on raw_task_runs_payload_v1. Once the merge tasks start failing with MEMORY_LIMIT_EXCEEDED, retries pile up and never recover.

What we observed

ClickHouse pod logs (representative, repeating thousands of times per minute):

Code: 241. DB::Exception: (total) memory limit exceeded:
  would use 7.00 EiB (attempt to allocate chunk of 4.00 MiB bytes),
  current RSS: 2.12 GiB, maximum: 21.60 GiB.
  OvercommitTracker decision: Query was selected to stop by OvercommitTracker.
  (MEMORY_LIMIT_EXCEEDED)
... while reading from part .../raw_task_runs_payload_v1/...
... in query: INSERT INTO trigger_dev.task_runs_v2 ...
... in query: INSERT INTO trigger_dev.task_events_v2 ...

The 7.00 EiB figure is the giveaway — that's ~2^63 bytes, i.e. a signed-integer overflow in ClickHouse's global memory tracker. The actual RSS is ~2 GiB.

Webapp logs (during the event):

EventRepo.DynamicFlushScheduler  Error attempting to flush batch
  consecutiveFailures: 19438
  table: trigger_dev.task_events_v2
  error: InsertError: (total) memory limit exceeded ... 7.00 EiB ...

The webapp accumulated ~1.75M backlogged events that had to drain after we restarted ClickHouse. UI showed "Unable to load your task runs" because the dashboard's reads against task_runs_v2 were rejected by the same tracker.

Root cause (best assessment)

This is a known class of ClickHouse memory-tracker accounting bug — a free() is double-counted (or an alloc() is missed) in one of the hot paths (background merges, async inserts, materialized views), the global atomic counter goes negative, the unsigned interpretation looks like ~9 EiB, and OvercommitTracker rejects every subsequent allocation. There are multiple upstream ClickHouse commits in 25.7+ touching memory-tracker accuracy and overflow handling.

Mitigation (workaround)

kubectl rollout restart statefulset/trigger-clickhouse-shard0 -n trigger

Recovery is immediate — error rate goes from ~11k/min to ~0 within seconds of the pod coming back up. Webapp consecutiveFailures drops from ~47k to single digits as soon as ClickHouse is reachable again.

Suggested resolution (chart-side)

The trigger Helm chart's Bitnami CH subchart pin (charts/clickhouse-9.3.7bitnami/clickhouse:25.6.1-debian-12-r0) hasn't moved since trigger@4.0.5 and is still the same in trigger@4.4.5. Bumping the Bitnami subchart pin in hosting/k8s/helm/Chart.yaml to a current version (latest Bitnami CH chart on main ships 25.7.5-debian-12-r0) would pull in upstream tracker fixes for everyone running self-hosted, without each operator having to override clickhouse.image.tag themselves.

Operators currently on trigger@4.0.54.4.5 are exposed to this regardless of chart version, since the subchart pin is unchanged.

Suggested resolution (operational hardening, optional)

A few small additions would make this much less painful even if the underlying bug isn't fully fixed:

  1. A liveness probe that exercises the query path (e.g. clickhouse-client -q "SELECT 1") on the bundled CH StatefulSet. The current TCP/HTTP probe stays green when the pod is wedged — a query-based probe would let Kubernetes auto-restart the pod within a minute. Today an operator has to notice the user-visible failure first.
  2. Webapp circuit breaker / backlog cap on EventRepo.DynamicFlushScheduler. When consecutiveFailures crosses some threshold, sample/drop trace events instead of accumulating millions of items in memory. Losing 5 minutes of partial trace data is preferable to a 1.75M-item backlog that takes hours to drain post-recovery and increases the chance of re-tripping the tracker.
  3. Document the failure mode and recovery in the self-hosting docs, so other operators recognize "Unable to load your task runs" + MEMORY_LIMIT_EXCEEDED ... 7 EiB as a single condition with a known one-line fix.

Happy to test a chart bump on our end and report back, or open a small PR against hosting/k8s/helm/ for the subchart bump if helpful.

— Filed by an operator running self-hosted Trigger.dev v4 on EKS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions