Skip to content

Router: API key auth middleware, LRU cache, and v2 metrics (NET-371/373/375)#58

Merged
mo4islona merged 1 commit into
arrowsquidfrom
net-371-router-auth-middleware
May 3, 2026
Merged

Router: API key auth middleware, LRU cache, and v2 metrics (NET-371/373/375)#58
mo4islona merged 1 commit into
arrowsquidfrom
net-371-router-auth-middleware

Conversation

@mo4islona
Copy link
Copy Markdown

Summary

  • NET-371: auth middleware on GET /network/:dataset/:start_block/worker — parses Authorization: Bearer sqd_data_<id>_<rnd>, validates via cache → Network API, returns 403 with CREDENTIALS_INVALID body + WWW-Authenticate: Bearer realm="sqd-archive". Gated by ENFORCE_V2_AUTH (default off): the flag changes only the action — parsing, cache, and Network API calls run either way so adoption metrics stay clean.
  • NET-373: in-process LRU cache (moka + own Clock trait for deterministic time tests) with three explicit states UNDEFINED / EXISTS / DELETED, asymmetric TTLs (60s / 15s), expires_at clamp, and singleflight (DashMap of per-key tokio::sync::Mutex) so concurrent misses for the same key produce one upstream call.
  • NET-375: Prometheus metrics (sqd_v2_auth_total, sqd_v2_auth_latency_seconds, sqd_v2_cache_{hit,miss}_total, sqd_v2_validate_call_total, sqd_v2_requests_by_key with bounded top-100 sketch + explicit remove_label_values on eviction so cardinality stays bounded) plus sqd_router_worker_urls_handed_total for the Worker:Router ratio canary. key_id is the only token-derived value in logs/labels; the full token is never logged.

Resilience details

  • Network API client: 250ms timeout, half-open circuit breaker (50 consecutive errors → open 30s → exactly one probe admitted; probe failure re-opens for another full window).
  • Fail-open on outage with cached DELETED still denying.
  • Strict token parser (rejects extra _, non-alphanumeric chars per Tech Plan §2.1 alphabet).

Wiring

  • cli.rs: --enforce-v2-auth / ENFORCE_V2_AUTH, --network-api-url / NETWORK_API_URL.
  • server.rs: middleware mounted only on the worker route via .route(...).layer(from_fn(auth)); /ping, /network/:dataset/height, /metrics stay open.
  • main.rs: NetworkApiClient::disabled() if NETWORK_API_URL unset (dev-friendly: every request fails open).

Test plan

  • cargo +1.89 test -p router51 passed, 0 failed
    • Cache (9): three states, TTL defaults, expires_at clamp, already-expired downgrade, capacity eviction, UNDEFINED ≠ DELETED regression
    • Client (12): 200/404/500/timeout/connection-error, malformed-success-body, missing-user_id, breaker open / probe / reset / half-open admits exactly 1 / probe failure re-opens
    • Middleware (20): Bearer extraction, no Token: fallback, cheap-reject, enforce on/off, bad-key flood (1 call/15s), timeout fail-open without cache write, cached Deleted denies during outage, breaker open passes through, key_id in extensions, constant-time mismatch, miss→hit, latency histogram, concurrent miss flood (32 racers → 1 API call), strict parser (extra _, missing rnd, non-alnum), malformed-body doesn't poison cache
    • Logging (2): full token never logged, warn carries key_id only
    • Top-keys (3): bounded cardinality, high-count keys retained, eviction signal
    • Singleflight (3): per-key serialisation, distinct keys parallel, no leak
    • Server (1): worker URL counter doesn't increment on 503
  • Local smoke test against a wiremock standing in for Network API
  • Staging soak: watch sqd_v2_validate_call_total{result="fail_open"} after deploy

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 28, 2026 15:02

This comment was marked as resolved.

@mo4islona mo4islona force-pushed the net-371-router-auth-middleware branch 9 times, most recently from 866dd7c to 1d4bbe1 Compare April 30, 2026 10:22
NET-371 + NET-373 + NET-375. Validates Bearer sqd_data_<id>_<rnd> on
GET /network/:dataset/:start_block/worker against an in-process LRU
(60s exists / 15s deleted, three explicit states UNDEFINED/EXISTS/DELETED
plus a brief FailedRecently sentinel) backed by POST /internal/validate.

Enforcement scope is configurable via ENFORCE_V2_AUTH_FOR_IPS — a CIDR
list whose source IPs trigger 403 on missing/invalid keys; everyone
else fails open. Empty (default) = observe-only canary mode; \`*\` =
enforce for all sources (expands to 0.0.0.0/0 + ::/0 internally);
specific CIDRs = canary scope. One knob — no separate boolean. Parsing,
cache, and Network API calls run regardless of scope, so observation-mode
metrics show what would happen under a flip.

Source-IP bypass via INTERNAL_ALLOWLIST + TRUSTED_IPS lets internal
traffic skip the Bearer path entirely. Real client IP is resolved by
walking X-Original-Forwarded-For rightmost-first, stripping TRUSTED_IPS
hops; the first non-trusted IP from the right is the source the upstream
LB observed at TCP handshake. This is robust against leftmost-XFF
spoofing (a real risk with ingress-nginx default proxy-real-ip-cidr=
0.0.0.0/0 — empirically reproduced before settling on this header).
Falls back to ConnectInfo<SocketAddr> when no XOFF is present (direct
ClusterIP path).

DISABLE_V2_AUTH global kill switch short-circuits at the very top of
the middleware: every request is allowed without parsing headers,
touching the cache, calling the Network API, or running the latency
timer. Counted as sqd_v2_auth_total{result="disabled"} so the dashboard
surfaces that the switch is engaged.

Network API client: 250ms timeout, half-open circuit breaker (50
errors / 30s window / one probe). Probe admission returns a RAII Permit
that records failure on Drop unless explicitly resolved, so a cancelled
probe re-opens the breaker instead of leaving it stuck HalfOpen.
Singleflight via DashMap of per-key tokio mutexes coalesces concurrent
misses to one upstream call; FailedRecently sentinel (1s TTL) drains
the queue without each waiter re-issuing a 250ms timeout.

Metrics: sqd_v2_auth_total{result=ok|missing|invalid|fail_open|disabled},
sqd_v2_auth_latency_seconds, sqd_v2_cache_{hit,miss}_total,
sqd_v2_validate_call_total, sqd_v2_requests_by_key (top-100 sketch
with explicit eviction so Prometheus cardinality stays bounded),
sqd_router_worker_urls_handed_total (Worker:Router ratio canary).
key_id is the only token-derived value in logs/labels; the full token
is never logged. Bypass requests use user_id="internal:<ip>" for
attribution.

90 tests covering cache states + TTL clamp, breaker half-open + probe
cancellation, fail-open paths + sentinel queue drain, singleflight
under concurrency + cleanup TOCTOU, log redaction, top-keys count
semantics, bounded label cardinality, IP-bypass + walk-rightmost XFF
anti-spoof (T1-T4 derived from empirical prod debugging), kill switch
short-circuit (no decide / no metrics drift), and canary-scope
enforcement (in-scope, out-of-scope, wildcard, narrow without
ConnectInfo, XOFF integration, allowlist precedence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mo4islona mo4islona force-pushed the net-371-router-auth-middleware branch from 1d4bbe1 to f3fa3f9 Compare April 30, 2026 12:44
@mo4islona mo4islona merged commit 9eafc5a into arrowsquid May 3, 2026
2 checks passed
@mo4islona mo4islona deleted the net-371-router-auth-middleware branch May 3, 2026 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants