Version
stepsecurity-dev-machine-guard v1.11.0 (linux_amd64)
OS
Fedora Linux 42 (Cloud Edition), kernel 6.19.12-100.fc42.x86_64
Command Run
./stepsecurity-dev-machine-guard install
(invoked via the Linux loader script's install path; reproduced both as the target user and via sudo with SUDO_USER privilege drop.)
Expected Behavior
A clean first install: timer registered, initial telemetry uploaded once, no errors recorded in ~/.stepsecurity/agent.error.log.
Actual Behavior
Every fresh install on Linux leaves the following two errors in ~/.stepsecurity/agent.error.log even though the install itself appears to succeed and a telemetry upload eventually returns HTTP 200:
==========================================
StepSecurity Device Agent v1.11.0
==========================================
[scanning] run-status[failed]: HTTP 400 (terminal, no retry)
[error] acquiring lock: another instance is already running (PID <X>)
…where PID <X> is the PID of the process that ran the inline post-install telemetry from the binary's install command. The lock contender is a second concurrent invocation of the binary that the user did not explicitly start.
Output / Error Messages
Sequence observed during a clean install on Fedora 42 (timestamps abbreviated):
18:43:24 [loader] Running binary install...
18:43:25 [binary] systemd user timer configuration completed successfully
18:43:25 [binary] Service: ~/.config/systemd/user/stepsecurity-dev-machine-guard.service
18:43:25 [binary] Timer: ~/.config/systemd/user/stepsecurity-dev-machine-guard.timer
18:43:25 [binary] Installation complete!
18:43:25 [binary] Sending initial telemetry...
18:43:25 [binary] Lock acquired (PID: 76249) <-- inline post-install telemetry
18:43:25 [error] run-status[failed]: HTTP 400 (terminal, no retry)
18:43:25 [error] acquiring lock: another instance is already running (PID 76249) <-- racing process
18:43:31 [binary] Telemetry collection completed successfully
18:43:31 [binary] Lock released (PID: 76249)
Root cause (suspected)
The Linux install path enables and immediately starts the timer, then runs initial telemetry inline:
internal/systemd/systemd.go:81 — systemctl --user enable --now stepsecurity-dev-machine-guard.timer
- Timer unit (same file, ~lines 181-183):
OnBootSec=5min
Persistent=true
cmd/stepsecurity-dev-machine-guard/main.go:132-137 — after the systemd.Install() call returns, the binary calls telemetry.Run(...) inline.
With Persistent=true and OnBootSec=5min, on any host whose uptime exceeds 5 minutes (i.e. effectively every install in the wild), enabling the timer with --now causes systemd to consider the trigger "missed" and fire the service immediately. That timer-triggered service runs send-telemetry and tries to acquire the singleton lock at the same moment main.go's inline telemetry.Run() is doing the same — they race, and whoever loses prints the acquiring lock: another instance is already running error and exits non-zero (the systemd-launched one in this case, since the inline call started a fraction earlier).
This appears to be Linux-specific. The macOS path (launchd.Install) and Windows path (schtasks.Install) presumably don't have an equivalent "fire immediately on register" behavior, hence no equivalent race.
Suggested fixes (any one is sufficient)
- Drop
--now from the enable call on Linux (systemd.go:81) so the inline telemetry.Run() in main.go:134 is the only initial run; the timer will naturally fire on its next scheduled tick.
- Skip the inline
telemetry.Run() on Linux in the install case in main.go, and rely on enable --now to trigger the first scan via the timer.
- Hold the singleton lock around the entire
install command so the timer-triggered service blocks until the inline run releases it (rather than failing fast with "another instance is already running"). This also fixes the misleading error in agent.error.log.
Option 1 is the smallest, most local change.
Additional Context
- Reproduced on a fresh Fedora 42 VM with no prior agent state (after
rm -rf ~/.stepsecurity and removing ~/.config/systemd/user/stepsecurity-*).
- Reproduces both when the loader is invoked as the target user and as
sudo (with the loader correctly dropping privileges via runuser + XDG_RUNTIME_DIR/DBUS_SESSION_BUS_ADDRESS).
- Telemetry uploads still succeed end-to-end (HTTP 200), so this is a "correctness of error log" issue rather than a functional install failure — but the persistent
[error] acquiring lock: another instance is already running message in agent.error.log is alarming for operators reviewing logs and looks like a real concurrency bug rather than a self-inflicted race.
- Separately worth noting (not the same bug, but related symptom amplifier): the Linux loader template that's served from the dashboard runs
binary install and then also runs binary send-telemetry immediately after, which produces a third back-to-back telemetry invocation for every install. With the timer-fires-immediately behavior above, that's effectively three telemetry runs queued up at install time. Worth removing the redundant send-telemetry from the loader template, but the race in this issue is reproducible without the loader's extra call too.
Version
stepsecurity-dev-machine-guard v1.11.0(linux_amd64)OS
Fedora Linux 42 (Cloud Edition), kernel 6.19.12-100.fc42.x86_64
Command Run
(invoked via the Linux loader script's
installpath; reproduced both as the target user and viasudowithSUDO_USERprivilege drop.)Expected Behavior
A clean first install: timer registered, initial telemetry uploaded once, no errors recorded in
~/.stepsecurity/agent.error.log.Actual Behavior
Every fresh
installon Linux leaves the following two errors in~/.stepsecurity/agent.error.logeven though the install itself appears to succeed and a telemetry upload eventually returns HTTP 200:…where
PID <X>is the PID of the process that ran the inline post-install telemetry from the binary'sinstallcommand. The lock contender is a second concurrent invocation of the binary that the user did not explicitly start.Output / Error Messages
Sequence observed during a clean install on Fedora 42 (timestamps abbreviated):
Root cause (suspected)
The Linux install path enables and immediately starts the timer, then runs initial telemetry inline:
internal/systemd/systemd.go:81—systemctl --user enable --now stepsecurity-dev-machine-guard.timercmd/stepsecurity-dev-machine-guard/main.go:132-137— after thesystemd.Install()call returns, the binary callstelemetry.Run(...)inline.With
Persistent=trueandOnBootSec=5min, on any host whose uptime exceeds 5 minutes (i.e. effectively every install in the wild), enabling the timer with--nowcauses systemd to consider the trigger "missed" and fire the service immediately. That timer-triggered service runssend-telemetryand tries to acquire the singleton lock at the same momentmain.go's inlinetelemetry.Run()is doing the same — they race, and whoever loses prints theacquiring lock: another instance is already runningerror and exits non-zero (the systemd-launched one in this case, since the inline call started a fraction earlier).This appears to be Linux-specific. The macOS path (
launchd.Install) and Windows path (schtasks.Install) presumably don't have an equivalent "fire immediately on register" behavior, hence no equivalent race.Suggested fixes (any one is sufficient)
--nowfrom theenablecall on Linux (systemd.go:81) so the inlinetelemetry.Run()inmain.go:134is the only initial run; the timer will naturally fire on its next scheduled tick.telemetry.Run()on Linux in theinstallcase inmain.go, and rely onenable --nowto trigger the first scan via the timer.installcommand so the timer-triggered service blocks until the inline run releases it (rather than failing fast with "another instance is already running"). This also fixes the misleading error inagent.error.log.Option 1 is the smallest, most local change.
Additional Context
rm -rf ~/.stepsecurityand removing~/.config/systemd/user/stepsecurity-*).sudo(with the loader correctly dropping privileges viarunuser+XDG_RUNTIME_DIR/DBUS_SESSION_BUS_ADDRESS).[error] acquiring lock: another instance is already runningmessage inagent.error.logis alarming for operators reviewing logs and looks like a real concurrency bug rather than a self-inflicted race.binary installand then also runsbinary send-telemetryimmediately after, which produces a third back-to-back telemetry invocation for every install. With the timer-fires-immediately behavior above, that's effectively three telemetry runs queued up at install time. Worth removing the redundantsend-telemetryfrom the loader template, but the race in this issue is reproducible without the loader's extra call too.