Skip to content

fix(orchestrator): capture sidecar stderr + raise default boot timeout (closes #1)#2

Open
evannadeau wants to merge 1 commit intoSpawnBox-dev:mainfrom
evannadeau:fix/sidecar-boot-timeout-and-log-capture
Open

fix(orchestrator): capture sidecar stderr + raise default boot timeout (closes #1)#2
evannadeau wants to merge 1 commit intoSpawnBox-dev:mainfrom
evannadeau:fix/sidecar-boot-timeout-and-log-capture

Conversation

@evannadeau
Copy link
Copy Markdown

Closes #1.

Three changes against mcp/server.ts, plus the rebuilt dist/server.js:

  1. Sidecar stdout/stderr → <pluginRoot>/.sidecar.log (truncated each boot). trySpawn accepts a log FD; replaces stdio: "ignore". FD opened once in startSidecar, shared across all uvx → python → python3 fallback attempts. Falls back to "ignore" if openSync throws — preserves prior behavior on open failure.
  2. ORCH_SIDECAR_BOOT_TIMEOUT_MS env override + uvx default 60 s → 180 s. Eliminates the residential-link failure described in orchestrator: silent sidecar boot failure on first install_embeddings — 60s timeout exceeded by cold-start downloads #1. Python/python3 fallback timeouts honor the env var but default unchanged at 30 s (assumes uvx already cached deps).
  3. system_status and install_embeddings now reference the log path and, when the bge-m3 model isn't yet cached, suggest the timeout env var. install_embeddings(check) adds a bge-m3 model cache: present (~10s boot expected) | not yet downloaded (~2 GB on first boot) line that disambiguates first-run from broken-run.

.gitignore picks up .sidecar.log. dist/server.js rebuilt via bun run build.

Tested

  • bun run typecheck — clean.
  • bun test — 330 pass / 1 fail. The single failure is in tests/hooks/hooks.test.ts (session-activity-nudge text mismatch) and reproduces against upstream/main without this patch — pre-existing, not introduced here.
  • Cold-start verified end-to-end on residential broadband (~100 Mbps WSL2). Reconstructed timing from process lstart + HF cache blob mtimes:
    • 10:00:41 — cold sidecar boot started (HF cache empty)
    • 10:01:54 — 2.27 GB model.onnx_data finished downloading (73 s in)
    • ~10:02:02 — Model ready, port file written, sidecar healthy (~80 s total)
    • The prior 60 s budget would have killed the spawn at 10:01:41, mid-download.
  • .sidecar.log captured the previously-discarded boot diagnostics (download progress, HF rate-limit warning, ONNX load, port write, "Listening on …").
  • install_embeddings(check) with cache populated returned the expected bge-m3 model cache: present (~10s boot expected) line.

Known minor limitation

.sidecar.log opens in "w" mode per startSidecar call, so a second startSidecar (e.g. on install_embeddings(install) after an automatic session-start spawn) truncates the prior attempt's log. Strictly better than stdio: "ignore", but consider switching to "a" (append) or rotate-on-open if maintainers want multi-attempt forensics. Happy to follow up.

Not changed

  • No version bump — leaving that to maintainer judgment.
  • No new tests — adding integration tests for spawn/log behavior would be heavy. Existing suite + manual cold-start cover the change.
  • No README/docs changes — the env var surfaces in install_embeddings(install) failure output, which is where users hit it.

closes SpawnBox-dev#1)

Three changes to mcp/server.ts make the sidecar boot path self-diagnosing
on slow connections and configurable for users who need more time:

1. Capture spawned-sidecar stderr to <pluginRoot>/.sidecar.log instead of
   stdio: "ignore". File opened (truncated) once per startSidecar call,
   shared across all uvx/python/python3 fallback attempts. Falls back to
   "ignore" if openSync throws — preserves prior behavior on open failure.

2. Raise default uvx boot timeout 60s → 180s; expose
   ORCH_SIDECAR_BOOT_TIMEOUT_MS env override applied to all spawn attempts.
   Eliminates the residential-broadband failure where downloading
   ~2 GB of bge-m3 model + onnxruntime wheels exceeds the prior budget.
   Fast-link users see no behavior change (boot still completes well inside
   180s on gigabit).

3. system_status and install_embeddings now reference the sidecar log path
   and, when the bge-m3 model isn't yet cached, suggest the timeout env var.
   install_embeddings(check) adds a "bge-m3 model cache: present /
   not yet downloaded" line that disambiguates first-run from broken-run.

.gitignore picks up .sidecar.log. dist/server.js rebuilt via `bun run build`.

bun run typecheck: clean.
bun test: 330 pass / 1 fail (pre-existing — hooks.test.ts session-activity
nudge text mismatch reproduces against upstream/main without this patch).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

orchestrator: silent sidecar boot failure on first install_embeddings — 60s timeout exceeded by cold-start downloads

1 participant