Skip to content

Fix WSLC exec hang on fast runc failure (e.g. invalid user/group)#40550

Merged
benhillis merged 1 commit into
microsoft:masterfrom
benhillis:fix/wslc-exec-pid-zero-hang
May 15, 2026
Merged

Fix WSLC exec hang on fast runc failure (e.g. invalid user/group)#40550
benhillis merged 1 commit into
microsoft:masterfrom
benhillis:fix/wslc-exec-pid-zero-hang

Conversation

@benhillis
Copy link
Copy Markdown
Member

Summary

Fixes a hang in wslc container exec when the exec'd process fails before runc forks it (e.g. wslc container exec -u root:badgid id). The wslc client hangs forever instead of returning exit code 126 and the "unable to find group badgid" error.

Root cause

WSLCContainerImpl::Exec polls Docker's exec inspect endpoint after StartExec to learn whether the user process is running or has already failed. The Running branch was guarded by state.Pid.has_value():

if (state.Running && state.Pid.has_value()) {
    control->SetPid(state.Pid.value());
    break;
}

This guard is meaningless because Docker's wire schema (backend.ExecInspect in moby) declares Pid as a non-nullable Go int that is 0 until runc forks the user process — so the JSON response always contains "Pid": 0 and nlohmann always deserializes that as optional<int>(0) with has_value() == true.

When runc fails before forking (invalid user/group, missing binary, etc.), Docker briefly reports {"Running": true, "Pid": 0, "ExitCode": null} in the small window between logging the error and running its deferred cleanup that sets Running=false, ExitCode=126. The polling loop:

  1. Accepts Pid=0 as a valid PID
  2. Calls SetPid(0)
  3. Breaks out of the loop
  4. Returns the process handle to wslc
  5. wslc waits on the exit event forever, because Docker never emits an exec_die event when the user process never spawned (containerd's process-exit event stream never fires)

Forensics

Confirmed via process dumps and ETL trace from a failing CloudTest run:

  • DockerExecProcessControl in the wslcsession dump has m_pid = 0 (with has_value() == true) and m_exitedCode unset; its exit event was never signaled.
  • ETL trace contains exactly four events for the failing exec id: exec_create, exec_start, dockerd ERROR "unable to find group badgid", and one GET /exec/{id}/json returning 200. No exec_die ever.
  • Compare the previous test (NameGroupRoot) where the user process actually runs: exec_die is emitted normally.

Fix

Change InspectExec.Pid from std::optional<int> to plain int to match the wire format (Go int is non-nullable), and check state.Pid > 0 at the call site. With this change the loop continues polling on Pid=0; on the next 100ms iteration Docker has settled state and the existing ExitCode branch fires with the correct exit code 126.

ExitCode remains std::optional<int> because moby's backend.ExecInspect.ExitCode is *int (genuinely nullable).

Validation

The existing E2E test WSLCE2EContainerExecTests::WSLCE2E_Container_Exec_UserOption_InvalidGroup_Fails is the regression test for this bug. With the fix it should pass reliably; without it, it hangs until the test host is killed.

WSLCContainerImpl::Exec polls Docker's exec inspect endpoint after
StartExec to learn whether the user process is running or has already
failed. The Running branch was guarded by `state.Pid.has_value()`,
which is meaningless because Docker's wire schema declares Pid as a
non-nullable Go int that is 0 until runc forks the user process - so
the JSON always contains `"Pid": 0` and nlohmann always deserializes
that as `optional<int>(0)` with `has_value() == true`.

When runc fails before forking (e.g. `-u root:badgid`), Docker
briefly reports `{Running: true, Pid: 0, ExitCode: null}` in the
window between logging the error and running its deferred cleanup that
sets `Running=false, ExitCode=126`. The polling loop accepted Pid=0
as a valid PID, called SetPid(0), broke out, and returned the process
to wslc. wslc then waited on the exit event forever, because Docker
never emits an `exec_die` event when the user process never spawned.

Change InspectExec.Pid from `std::optional<int>` to `int` to match
the wire format, and check `state.Pid > 0` at the call site. With
this change the loop continues polling on Pid=0; on the next iteration
Docker has settled state and the existing ExitCode branch fires with
the correct exit code (126).

Verified against the failing test
WSLCE2EContainerExecTests::WSLCE2E_Container_Exec_UserOption_InvalidGroup_Fails,
which is the regression test for this bug.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@benhillis benhillis requested a review from a team as a code owner May 15, 2026 02:13
@benhillis
Copy link
Copy Markdown
Member Author

Hit this test failure in the release pipeline - I suspect it's just a somewhat tight race window.

Copy link
Copy Markdown
Collaborator

@OneBlue OneBlue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch ! We might as also want to add assert that Pid > 0 in SetPid() to avoid getting hung if we ever hit something similar in the future

@benhillis benhillis merged commit 5b7206b into microsoft:master May 15, 2026
9 checks passed
benhillis added a commit that referenced this pull request May 16, 2026
Defense-in-depth follow-up to PR #40550. The exec polling loop in
WSLCContainerImpl::Exec now correctly filters Pid > 0 before calling
SetPid, but a future caller that bypassed that check would silently
hang the process wait (because Docker never emits exec_die for a
process that never spawned). Assert at the lowest level so any such
regression fires loudly in Debug builds.

Suggested by @OneBlue in the PR #40550 review.

Co-authored-by: Ben Hillis <benhill@ntdev.microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants