Skip to content

[Bug]: Lock contention on Linux install — enable --now timer races with inline post-install telemetry #62

@swarit-stepsecurity

Description

@swarit-stepsecurity

Version

stepsecurity-dev-machine-guard v1.11.0 (linux_amd64)

OS

Fedora Linux 42 (Cloud Edition), kernel 6.19.12-100.fc42.x86_64

Command Run

./stepsecurity-dev-machine-guard install

(invoked via the Linux loader script's install path; reproduced both as the target user and via sudo with SUDO_USER privilege drop.)

Expected Behavior

A clean first install: timer registered, initial telemetry uploaded once, no errors recorded in ~/.stepsecurity/agent.error.log.

Actual Behavior

Every fresh install on Linux leaves the following two errors in ~/.stepsecurity/agent.error.log even though the install itself appears to succeed and a telemetry upload eventually returns HTTP 200:

==========================================
StepSecurity Device Agent v1.11.0
==========================================

[scanning] run-status[failed]: HTTP 400 (terminal, no retry)
[error] acquiring lock: another instance is already running (PID <X>)

…where PID <X> is the PID of the process that ran the inline post-install telemetry from the binary's install command. The lock contender is a second concurrent invocation of the binary that the user did not explicitly start.

Output / Error Messages

Sequence observed during a clean install on Fedora 42 (timestamps abbreviated):

18:43:24  [loader] Running binary install...
18:43:25  [binary] systemd user timer configuration completed successfully
18:43:25  [binary]   Service: ~/.config/systemd/user/stepsecurity-dev-machine-guard.service
18:43:25  [binary]   Timer:   ~/.config/systemd/user/stepsecurity-dev-machine-guard.timer
18:43:25  [binary] Installation complete!
18:43:25  [binary] Sending initial telemetry...
18:43:25  [binary] Lock acquired (PID: 76249)         <-- inline post-install telemetry
18:43:25  [error]  run-status[failed]: HTTP 400 (terminal, no retry)
18:43:25  [error]  acquiring lock: another instance is already running (PID 76249)   <-- racing process
18:43:31  [binary] Telemetry collection completed successfully
18:43:31  [binary] Lock released (PID: 76249)

Root cause (suspected)

The Linux install path enables and immediately starts the timer, then runs initial telemetry inline:

  • internal/systemd/systemd.go:81systemctl --user enable --now stepsecurity-dev-machine-guard.timer
  • Timer unit (same file, ~lines 181-183):
    OnBootSec=5min
    Persistent=true
  • cmd/stepsecurity-dev-machine-guard/main.go:132-137 — after the systemd.Install() call returns, the binary calls telemetry.Run(...) inline.

With Persistent=true and OnBootSec=5min, on any host whose uptime exceeds 5 minutes (i.e. effectively every install in the wild), enabling the timer with --now causes systemd to consider the trigger "missed" and fire the service immediately. That timer-triggered service runs send-telemetry and tries to acquire the singleton lock at the same moment main.go's inline telemetry.Run() is doing the same — they race, and whoever loses prints the acquiring lock: another instance is already running error and exits non-zero (the systemd-launched one in this case, since the inline call started a fraction earlier).

This appears to be Linux-specific. The macOS path (launchd.Install) and Windows path (schtasks.Install) presumably don't have an equivalent "fire immediately on register" behavior, hence no equivalent race.

Suggested fixes (any one is sufficient)

  1. Drop --now from the enable call on Linux (systemd.go:81) so the inline telemetry.Run() in main.go:134 is the only initial run; the timer will naturally fire on its next scheduled tick.
  2. Skip the inline telemetry.Run() on Linux in the install case in main.go, and rely on enable --now to trigger the first scan via the timer.
  3. Hold the singleton lock around the entire install command so the timer-triggered service blocks until the inline run releases it (rather than failing fast with "another instance is already running"). This also fixes the misleading error in agent.error.log.

Option 1 is the smallest, most local change.

Additional Context

  • Reproduced on a fresh Fedora 42 VM with no prior agent state (after rm -rf ~/.stepsecurity and removing ~/.config/systemd/user/stepsecurity-*).
  • Reproduces both when the loader is invoked as the target user and as sudo (with the loader correctly dropping privileges via runuser + XDG_RUNTIME_DIR/DBUS_SESSION_BUS_ADDRESS).
  • Telemetry uploads still succeed end-to-end (HTTP 200), so this is a "correctness of error log" issue rather than a functional install failure — but the persistent [error] acquiring lock: another instance is already running message in agent.error.log is alarming for operators reviewing logs and looks like a real concurrency bug rather than a self-inflicted race.
  • Separately worth noting (not the same bug, but related symptom amplifier): the Linux loader template that's served from the dashboard runs binary install and then also runs binary send-telemetry immediately after, which produces a third back-to-back telemetry invocation for every install. With the timer-fires-immediately behavior above, that's effectively three telemetry runs queued up at install time. Worth removing the redundant send-telemetry from the loader template, but the race in this issue is reproducible without the loader's extra call too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions