Skip to content

fix: thread-safety hardening of Configuration (cache and in-place mutation races)#620

Open
fgmacedo wants to merge 4 commits into
developfrom
fix/configuration-states-cache-race
Open

fix: thread-safety hardening of Configuration (cache and in-place mutation races)#620
fgmacedo wants to merge 4 commits into
developfrom
fix/configuration-states-cache-race

Conversation

@fgmacedo
Copy link
Copy Markdown
Owner

@fgmacedo fgmacedo commented May 15, 2026

Summary

Fixes two thread-safety races in Configuration introduced indirectly by the
caching / no-copy fast paths added in 3.1.0 (#592). The sync engine is
documented as supporting concurrent reads of .configuration while another
thread sends events; both races violate that contract.

Race A — cached states could return None

Configuration.states checked self._cached is not None and then returned
self._cached. Another thread invalidating between the check and return
caused None to be returned and a TypeError in list(machine.configuration).
Fixed by snapshotting _cached and _cached_value locally before the
freshness check (commit d2ddf1e).

Race B — in-place mutation of the model's OrderedSet

Configuration.add() / discard() mutated the same OrderedSet reference
the model holds, in place, then rewrote the same ref. A concurrent reader
iterating that reference inside Configuration.states could crash with
RuntimeError: Set changed size during iteration, and the cache identity
check could briefly hand back a stale resolved set. Fixed by copy-on-write
in add and discard (commit 38f8eef).

Race B is only reachable when atomic_configuration_update=False
(the StateChart default, where parallel regions need it). The atomic path
used by StateMachine was never affected.

Tests

  • tests/test_threading.py::TestThreadSafety::test_concurrent_send_and_read_configuration
    covers Race A.
  • New test_concurrent_parallel_region_send_and_read exercises a parallel
    StateChart under concurrent writes and reads.
  • New test_add_discard_produce_fresh_orderedset is a deterministic
    contract pin for copy-on-write (fails on pre-fix code without timing).

100% branch coverage on statemachine/configuration.py preserved.

Performance

Copy-on-write in add/discard reintroduces an O(n) shallow copy of the
active configuration on every state entry and exit. Median results
(macOS / Python 3.14, pytest-benchmark), 3.1.0 → 3.1.1:

Benchmark 3.1.0 3.1.1 Δ
test_parallel_region_events 175.2 μs 184.5 μs +5.3%
test_many_transitions_reset 125.9 μs 139.5 μs +10.9%
test_guarded_transitions 70.0 μs 75.7 μs +8.2%
test_history_pause_resume 88.4 μs 91.4 μs +3.4%
test_many_transitions_full_cycle 156.9 μs 162.1 μs +3.3%
test_flat_self_transition 38.7 μs 39.1 μs +1.0%

Overall 4.7x–7.7x event throughput vs 3.0.0 (declared in 3.1.0) is
unchanged.

Release

Lands as 3.1.1. See docs/releases/3.1.1.md.

Notes for reviewers

… reads

Snapshot the cache fields locally before the freshness check. Without
the snapshot, another thread invalidating the cache between the
``self._cached is not None`` check and the ``return self._cached`` could
cause the getter to return ``None``, leading to ``TypeError`` in callers
that iterate the result (e.g., ``list(machine.configuration)``).

Surfaced intermittently by tests/test_threading.py
::TestThreadSafety::test_concurrent_send_and_read_configuration on
slower CI runners.

Signed-off-by: Fernando Macedo <fgmacedo@gmail.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (847f1d5) to head (b242a3b).

Additional details and impacted files
@@            Coverage Diff            @@
##           develop      #620   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           42        42           
  Lines         5027      5029    +2     
  Branches       812       812           
=========================================
+ Hits          5027      5029    +2     
Flag Coverage Δ
unittests 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fgmacedo added 3 commits May 15, 2026 18:27
``Configuration.add()`` and ``discard()`` previously mutated the
``OrderedSet`` stored on the model in place, then rewrote the same
reference. Two thread-safety problems followed:

1. A concurrent reader iterating ``.configuration`` (cache miss path
   inside the ``states`` getter) could crash with ``RuntimeError: Set
   changed size during iteration``.
2. Because ``setattr`` re-stored the *same* OrderedSet ref, the cache
   identity check ``self._cached_value is raw`` stayed True after the
   mutation, so a reader could briefly receive the stale cached
   ``OrderedSet[State]`` missing the new state.

Both ``add`` and ``discard`` now copy the underlying set before
mutating, producing a fresh ref each call. Readers either continue
iterating the prior ref (no concurrent mutation) or recompute against
the new ref (cache identity check naturally fails on the new object).

Affects only ``StateChart``-style configurations where
``atomic_configuration_update=False`` (the default for parallel
regions). The atomic update path used by ``StateMachine`` was never
affected because it always builds a fresh OrderedSet.

Regression tests in ``tests/test_threading.py::TestThreadSafety``:
``test_concurrent_parallel_region_send_and_read`` and the
deterministic contract pin ``test_add_discard_produce_fresh_orderedset``.

Signed-off-by: Fernando Macedo <fgmacedo@gmail.com>
Two cache-related races in ``Configuration`` introduced indirectly by
the 3.1.0 performance optimisation are fixed in 3.1.1. Includes a
performance comparison table; the 4.7x-7.7x speedup vs 3.0.0 declared
in 3.1.0 release notes is unchanged.

Signed-off-by: Fernando Macedo <fgmacedo@gmail.com>
- Bump version in pyproject.toml, ``statemachine.__version__``, and uv.lock
- Update Project-Id-Version in locale catalogs (en, pt_BR, hi_IN, zh_CN)

Signed-off-by: Fernando Macedo <fgmacedo@gmail.com>
@fgmacedo fgmacedo changed the title fix: race in Configuration.states cache under concurrent reads fix: thread-safety hardening of Configuration (cache and in-place mutation races) May 15, 2026
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant