Skip to content

Auto-evict stale queue_processes rows on worker startup#493

Merged
dereuromark merged 2 commits into
masterfrom
feature/heartbeat-aware-pid-slot
May 8, 2026
Merged

Auto-evict stale queue_processes rows on worker startup#493
dereuromark merged 2 commits into
masterfrom
feature/heartbeat-aware-pid-slot

Conversation

@dereuromark
Copy link
Copy Markdown
Owner

Summary

  • New config: Queue.staleHeartbeatThreshold (default 90s).
  • QueueProcessesTable::add() now evicts a stale (pid, server) row whose modified is older than the threshold before attempting the insert.
  • Fixes the crash loop on container restarts (FrankenPHP / Docker): killed workers leave orphan rows and the new worker — getting the same recycled PID — fails the unique-index insert.

Why a separate threshold (not defaultRequeueTimeout)

defaultRequeueTimeout (default 10 min) is the threshold for treating an in-flight job as abandoned. Deliberately long, to avoid double-execution. Workers heartbeat every loop iteration — far more often. A row not refreshed in ~90s almost certainly belongs to a dead worker, but waiting 10 min before evicting it makes container redeploys painful.

Caught vs. uncaught exception path

Worth highlighting: the unique (pid, server) collision throws Cake\\Database\\Exception\\QueryException at the DB level — not PersistenceFailedException. The existing catch in Processor::run() only handles the latter, so without this PR the worker crashes hard. The new test testAddDoesNotEvictRecentRowOnSamePidServer asserts that exact exception type so any future swap of the constraint mechanism is loud.

Relationship to other PRs

  • PR Add --force flag to queue worker clean #492 (--force flag on queue worker clean) is a complementary operator escape hatch.
  • A follow-up draft will explore making workerkey the canonical identity and dropping the (pid, server) unique index entirely; that would make this eviction logic redundant.

Adds Queue.staleHeartbeatThreshold config (default 90s).
QueueProcessesTable::add() now evicts a stale (pid, server) row whose
modified is older than the threshold *before* attempting the insert.

Fixes the crash loop seen on container restarts (FrankenPHP / Docker),
where killed workers leave orphan rows and the new worker - getting the
same recycled PID - fails the unique-index insert with a QueryException
that the existing PersistenceFailedException catch in Processor::run()
doesn't handle.

The threshold is intentionally separate from defaultRequeueTimeout
(default 10min): that one governs in-flight job requeueing and is
deliberately long; workers heartbeat far more often, so a row not
refreshed in 90s almost certainly belongs to a dead worker.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 8, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.58%. Comparing base (1f88ba5) to head (226de43).
⚠️ Report is 1 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff              @@
##             master     #493      +/-   ##
============================================
+ Coverage     77.51%   77.58%   +0.07%     
- Complexity      971      972       +1     
============================================
  Files            45       45              
  Lines          3255     3266      +11     
============================================
+ Hits           2523     2534      +11     
  Misses          732      732              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ware-pid-slot

# Conflicts:
#	tests/TestCase/Model/Table/QueueProcessesTableTest.php
@dereuromark dereuromark merged commit 85dbb81 into master May 8, 2026
16 checks passed
@dereuromark dereuromark deleted the feature/heartbeat-aware-pid-slot branch May 8, 2026 15:03
dereuromark added a commit that referenced this pull request May 8, 2026
…494)

PID is not a stable worker identity. The OS recycles low PIDs across
container restarts, and the unique index on (pid, server) collides
whenever a containerized worker is replaced before its row is cleaned.
workerkey is already a UUID-style hash generated per process - the
correct identity, and already uniquely indexed on its own.

Schema:
- New migration drops the unique index on (pid, server) and re-adds
  it as a non-unique index for query performance.
- Test fixture mirrors the change.

Behavior:
- update() and remove() now look up by workerkey.
- Processor stores the workerkey alongside the pid and passes it on
  heartbeat / shutdown.
- endProcess() picks the most recently heartbeated row when multiple
  rows share a PID. terminateProcess() already wipes by pid which
  remains correct - the OS guarantees only one live process per pid
  at any moment.

Open questions for review:
- Should worker end / kill accept --workerkey as an alternative to
  PID for precise targeting?
- Can the eviction logic added in PR #493 be removed once this lands,
  or kept as belt-and-braces?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants