Router: API key auth middleware, LRU cache, and v2 metrics (NET-371/373/375)#58
Merged
Merged
Conversation
866dd7c to
1d4bbe1
Compare
NET-371 + NET-373 + NET-375. Validates Bearer sqd_data_<id>_<rnd> on
GET /network/:dataset/:start_block/worker against an in-process LRU
(60s exists / 15s deleted, three explicit states UNDEFINED/EXISTS/DELETED
plus a brief FailedRecently sentinel) backed by POST /internal/validate.
Enforcement scope is configurable via ENFORCE_V2_AUTH_FOR_IPS — a CIDR
list whose source IPs trigger 403 on missing/invalid keys; everyone
else fails open. Empty (default) = observe-only canary mode; \`*\` =
enforce for all sources (expands to 0.0.0.0/0 + ::/0 internally);
specific CIDRs = canary scope. One knob — no separate boolean. Parsing,
cache, and Network API calls run regardless of scope, so observation-mode
metrics show what would happen under a flip.
Source-IP bypass via INTERNAL_ALLOWLIST + TRUSTED_IPS lets internal
traffic skip the Bearer path entirely. Real client IP is resolved by
walking X-Original-Forwarded-For rightmost-first, stripping TRUSTED_IPS
hops; the first non-trusted IP from the right is the source the upstream
LB observed at TCP handshake. This is robust against leftmost-XFF
spoofing (a real risk with ingress-nginx default proxy-real-ip-cidr=
0.0.0.0/0 — empirically reproduced before settling on this header).
Falls back to ConnectInfo<SocketAddr> when no XOFF is present (direct
ClusterIP path).
DISABLE_V2_AUTH global kill switch short-circuits at the very top of
the middleware: every request is allowed without parsing headers,
touching the cache, calling the Network API, or running the latency
timer. Counted as sqd_v2_auth_total{result="disabled"} so the dashboard
surfaces that the switch is engaged.
Network API client: 250ms timeout, half-open circuit breaker (50
errors / 30s window / one probe). Probe admission returns a RAII Permit
that records failure on Drop unless explicitly resolved, so a cancelled
probe re-opens the breaker instead of leaving it stuck HalfOpen.
Singleflight via DashMap of per-key tokio mutexes coalesces concurrent
misses to one upstream call; FailedRecently sentinel (1s TTL) drains
the queue without each waiter re-issuing a 250ms timeout.
Metrics: sqd_v2_auth_total{result=ok|missing|invalid|fail_open|disabled},
sqd_v2_auth_latency_seconds, sqd_v2_cache_{hit,miss}_total,
sqd_v2_validate_call_total, sqd_v2_requests_by_key (top-100 sketch
with explicit eviction so Prometheus cardinality stays bounded),
sqd_router_worker_urls_handed_total (Worker:Router ratio canary).
key_id is the only token-derived value in logs/labels; the full token
is never logged. Bypass requests use user_id="internal:<ip>" for
attribution.
90 tests covering cache states + TTL clamp, breaker half-open + probe
cancellation, fail-open paths + sentinel queue drain, singleflight
under concurrency + cleanup TOCTOU, log redaction, top-keys count
semantics, bounded label cardinality, IP-bypass + walk-rightmost XFF
anti-spoof (T1-T4 derived from empirical prod debugging), kill switch
short-circuit (no decide / no metrics drift), and canary-scope
enforcement (in-scope, out-of-scope, wildcard, narrow without
ConnectInfo, XOFF integration, allowlist precedence).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1d4bbe1 to
f3fa3f9
Compare
tmcgroul
approved these changes
May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GET /network/:dataset/:start_block/worker— parsesAuthorization: Bearer sqd_data_<id>_<rnd>, validates via cache → Network API, returns 403 withCREDENTIALS_INVALIDbody +WWW-Authenticate: Bearer realm="sqd-archive". Gated byENFORCE_V2_AUTH(default off): the flag changes only the action — parsing, cache, and Network API calls run either way so adoption metrics stay clean.moka+ ownClocktrait for deterministic time tests) with three explicit statesUNDEFINED/EXISTS/DELETED, asymmetric TTLs (60s / 15s),expires_atclamp, and singleflight (DashMapof per-keytokio::sync::Mutex) so concurrent misses for the same key produce one upstream call.sqd_v2_auth_total,sqd_v2_auth_latency_seconds,sqd_v2_cache_{hit,miss}_total,sqd_v2_validate_call_total,sqd_v2_requests_by_keywith bounded top-100 sketch + explicitremove_label_valueson eviction so cardinality stays bounded) plussqd_router_worker_urls_handed_totalfor the Worker:Router ratio canary.key_idis the only token-derived value in logs/labels; the full token is never logged.Resilience details
DELETEDstill denying._, non-alphanumeric chars per Tech Plan §2.1 alphabet).Wiring
cli.rs:--enforce-v2-auth/ENFORCE_V2_AUTH,--network-api-url/NETWORK_API_URL.server.rs: middleware mounted only on the worker route via.route(...).layer(from_fn(auth));/ping,/network/:dataset/height,/metricsstay open.main.rs:NetworkApiClient::disabled()ifNETWORK_API_URLunset (dev-friendly: every request fails open).Test plan
cargo +1.89 test -p router— 51 passed, 0 failedexpires_atclamp, already-expired downgrade, capacity eviction,UNDEFINED ≠ DELETEDregressionToken:fallback, cheap-reject, enforce on/off, bad-key flood (1 call/15s), timeout fail-open without cache write, cachedDeleteddenies during outage, breaker open passes through,key_idin extensions, constant-time mismatch, miss→hit, latency histogram, concurrent miss flood (32 racers → 1 API call), strict parser (extra_, missing rnd, non-alnum), malformed-body doesn't poison cachekey_idonlysqd_v2_validate_call_total{result="fail_open"}after deploy🤖 Generated with Claude Code