Fix WSLC exec hang on fast runc failure (e.g. invalid user/group) by benhillis · Pull Request #40550 · microsoft/WSL

benhillis · 2026-05-15T02:13:50Z

Summary

Fixes a hang in wslc container exec when the exec'd process fails before runc forks it (e.g. wslc container exec -u root:badgid id). The wslc client hangs forever instead of returning exit code 126 and the "unable to find group badgid" error.

Root cause

WSLCContainerImpl::Exec polls Docker's exec inspect endpoint after StartExec to learn whether the user process is running or has already failed. The Running branch was guarded by state.Pid.has_value():

if (state.Running && state.Pid.has_value()) {
    control->SetPid(state.Pid.value());
    break;
}

This guard is meaningless because Docker's wire schema (backend.ExecInspect in moby) declares Pid as a non-nullable Go int that is 0 until runc forks the user process — so the JSON response always contains "Pid": 0 and nlohmann always deserializes that as optional<int>(0) with has_value() == true.

When runc fails before forking (invalid user/group, missing binary, etc.), Docker briefly reports {"Running": true, "Pid": 0, "ExitCode": null} in the small window between logging the error and running its deferred cleanup that sets Running=false, ExitCode=126. The polling loop:

Accepts Pid=0 as a valid PID
Calls SetPid(0)
Breaks out of the loop
Returns the process handle to wslc
wslc waits on the exit event forever, because Docker never emits an exec_die event when the user process never spawned (containerd's process-exit event stream never fires)

Forensics

Confirmed via process dumps and ETL trace from a failing CloudTest run:

DockerExecProcessControl in the wslcsession dump has m_pid = 0 (with has_value() == true) and m_exitedCode unset; its exit event was never signaled.
ETL trace contains exactly four events for the failing exec id: exec_create, exec_start, dockerd ERROR "unable to find group badgid", and one GET /exec/{id}/json returning 200. No exec_die ever.
Compare the previous test (NameGroupRoot) where the user process actually runs: exec_die is emitted normally.

Fix

Change InspectExec.Pid from std::optional<int> to plain int to match the wire format (Go int is non-nullable), and check state.Pid > 0 at the call site. With this change the loop continues polling on Pid=0; on the next 100ms iteration Docker has settled state and the existing ExitCode branch fires with the correct exit code 126.

ExitCode remains std::optional<int> because moby's backend.ExecInspect.ExitCode is *int (genuinely nullable).

Validation

The existing E2E test WSLCE2EContainerExecTests::WSLCE2E_Container_Exec_UserOption_InvalidGroup_Fails is the regression test for this bug. With the fix it should pass reliably; without it, it hangs until the test host is killed.

WSLCContainerImpl::Exec polls Docker's exec inspect endpoint after StartExec to learn whether the user process is running or has already failed. The Running branch was guarded by `state.Pid.has_value()`, which is meaningless because Docker's wire schema declares Pid as a non-nullable Go int that is 0 until runc forks the user process - so the JSON always contains `"Pid": 0` and nlohmann always deserializes that as `optional<int>(0)` with `has_value() == true`. When runc fails before forking (e.g. `-u root:badgid`), Docker briefly reports `{Running: true, Pid: 0, ExitCode: null}` in the window between logging the error and running its deferred cleanup that sets `Running=false, ExitCode=126`. The polling loop accepted Pid=0 as a valid PID, called SetPid(0), broke out, and returned the process to wslc. wslc then waited on the exit event forever, because Docker never emits an `exec_die` event when the user process never spawned. Change InspectExec.Pid from `std::optional<int>` to `int` to match the wire format, and check `state.Pid > 0` at the call site. With this change the loop continues polling on Pid=0; on the next iteration Docker has settled state and the existing ExitCode branch fires with the correct exit code (126). Verified against the failing test WSLCE2EContainerExecTests::WSLCE2E_Container_Exec_UserOption_InvalidGroup_Fails, which is the regression test for this bug. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

benhillis · 2026-05-15T02:14:33Z

Hit this test failure in the release pipeline - I suspect it's just a somewhat tight race window.

OneBlue

Great catch ! We might as also want to add assert that Pid > 0 in SetPid() to avoid getting hung if we ever hit something similar in the future

@OneBlue

Defense-in-depth follow-up to PR #40550. The exec polling loop in WSLCContainerImpl::Exec now correctly filters Pid > 0 before calling SetPid, but a future caller that bypassed that check would silently hang the process wait (because Docker never emits exec_die for a process that never spawned). Assert at the lowest level so any such regression fires loudly in Debug builds. Suggested by @OneBlue in the PR #40550 review. Co-authored-by: Ben Hillis <benhill@ntdev.microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

benhillis requested a review from a team as a code owner May 15, 2026 02:13

benhillis mentioned this pull request May 15, 2026

wslc: align docker_schema with bundled dockerd v25.0.3 (API v1.44) #40552

Open

dkbennett approved these changes May 15, 2026

View reviewed changes

OneBlue approved these changes May 15, 2026

View reviewed changes

benhillis merged commit 5b7206b into microsoft:master May 15, 2026
9 checks passed

benhillis mentioned this pull request May 15, 2026

wslc: assert Pid > 0 in DockerExecProcessControl::SetPid #40567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix WSLC exec hang on fast runc failure (e.g. invalid user/group)#40550

Fix WSLC exec hang on fast runc failure (e.g. invalid user/group)#40550
benhillis merged 1 commit into
microsoft:masterfrom
benhillis:fix/wslc-exec-pid-zero-hang

benhillis commented May 15, 2026

Uh oh!

benhillis commented May 15, 2026

Uh oh!

OneBlue left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

benhillis commented May 15, 2026

Summary

Root cause

Forensics

Fix

Validation

Uh oh!

benhillis commented May 15, 2026

Uh oh!

OneBlue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants