Skip to content

Response status updates#531

Open
MichaelGHSeg wants to merge 39 commits intomasterfrom
response-status-updates
Open

Response status updates#531
MichaelGHSeg wants to merge 39 commits intomasterfrom
response-status-updates

Conversation

@MichaelGHSeg
Copy link
Copy Markdown
Contributor

No description provided.

MichaelGHSeg and others added 23 commits December 3, 2025 15:03
* Changing release plugin

* Deploy working

* Rolling back to snapshot

* [maven-release-plugin] prepare release analytics-parent-3.5.4

* [maven-release-plugin] prepare for next development iteration

* putting back release instructions
- Add Authorization header with Basic auth to all requests
- Reduce retry backoff times for faster recovery (100ms base, 1min cap)
- Respect Retry-After header with 5-minute max cap
- Always include X-Retry-Count header in requests
- Consolidate SegmentService upload methods
- Increase max flush attempts to accommodate shorter backoff times

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove 413 (Payload Too Large) from retryable status codes
  Payload size won't shrink on retry, so retrying is pointless

- Verify 503 Retry-After handling matches spec
  503 checks Retry-After header first, falls back to exponential backoff

- Add test for 413 non-retryable behavior
- Add test for Retry-After delay mechanism (validates cap logic)
  The 300-second cap is enforced at AnalyticsClient.java:556-558

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix import ordering (anyInt moved to correct position)
- Split long lines to respect line length limits
- Remove extra blank line

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements comprehensive improvements to the response status handling and retry mechanism for the Segment Analytics client. The changes introduce Retry-After header support, update the retry logic to distinguish between server-directed and backoff-based retries, add an X-Retry-Count header to all requests, and move authentication from URL to Authorization header using Basic auth.

Changes:

  • Implemented Retry-After header parsing and handling for status codes 429, 408, and 503, with retries based on server-specified delays not counting against the client's retry budget
  • Added X-Retry-Count header to all upload requests to communicate retry attempt number (starting at 0)
  • Moved authentication from URL-based to Authorization header using HTTP Basic auth with the write key
  • Expanded retryable status codes to include specific 4xx errors (408, 410, 429, 460) and refined 5xx retry logic to exclude 501, 505, and 511
  • Adjusted backoff configuration (100ms base, 1min cap) and increased default maximumFlushAttempts from 3 to 1000
  • Added comprehensive test coverage for new retry scenarios

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
analytics-core/src/main/java/com/segment/analytics/http/SegmentService.java Added X-Retry-Count header parameter to upload method signature
analytics/src/main/java/com/segment/analytics/AnalyticsRequestInterceptor.java Added writeKey parameter and Authorization header construction using Basic auth
analytics/src/main/java/com/segment/analytics/Analytics.java Updated interceptor instantiation with writeKey and increased default maximumFlushAttempts to 1000
analytics/src/main/java/com/segment/analytics/internal/AnalyticsClient.java Implemented retry strategy enum, Retry-After header parsing, updated backoff configuration, refined status code retry logic, and separated tracking of total attempts vs backoff attempts
analytics/src/test/java/com/segment/analytics/internal/AnalyticsClientTest.java Updated all mock calls to include retry count parameter and added 7 new test cases covering retryable client errors, non-retryable 5xx errors, Retry-After handling, X-Retry-Count header verification, and 413 Payload Too Large behavior
analytics/src/test/java/com/segment/analytics/AnalyticsRequestInterceptorTest.java Added assertion to verify Authorization header is correctly set with Basic auth credentials

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread analytics/src/main/java/com/segment/analytics/internal/AnalyticsClient.java Outdated
Comment thread analytics/src/main/java/com/segment/analytics/Analytics.java Outdated
Comment on lines +542 to 561
private static Long parseRetryAfterSeconds(String headerValue) {
if (headerValue == null) {
return null;
}
headerValue = headerValue.trim();
if (headerValue.isEmpty()) {
return null;
}
try {
long seconds = Long.parseLong(headerValue);
if (seconds <= 0L) {
return null;
}
if (seconds > MAX_RETRY_AFTER_SECONDS) {
return MAX_RETRY_AFTER_SECONDS;
}
return seconds;
} catch (NumberFormatException ignored) {
return null;
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parseRetryAfterSeconds method handles several edge cases (null, empty, negative, zero, over cap) but lacks explicit test coverage for these scenarios. While the happy path is tested in retryAfterHeaderRespectsShortDelay, consider adding tests for edge cases such as: invalid numeric format, negative values, zero, values exceeding MAX_RETRY_AFTER_SECONDS, empty strings, and whitespace-only strings to ensure the method is robust.

Copilot uses AI. Check for mistakes.
Comment on lines +616 to +617
if (status == 408 || status == 410 || status == 429 || status == 460) {
return true;
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for HTTP status codes 410 and 460, which are explicitly marked as retryable in the isStatusRetryWithBackoff method. Add tests to verify that these status codes are indeed retried with backoff as intended.

Copilot uses AI. Check for mistakes.
Comment on lines +538 to +539
private static boolean isStatusRetryAfterEligible(int status) {
return status == 429 || status == 408 || status == 503;
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for HTTP status code 503 (Service Unavailable), which is both Retry-After eligible and retryable with backoff. Add tests to verify: 1) 503 with a valid Retry-After header uses the RETRY_AFTER strategy, and 2) 503 without a Retry-After header falls back to the BACKOFF strategy.

Copilot uses AI. Check for mistakes.
Comment on lines +621 to +622
if (status >= 500 && status < 600) {
return status != 501 && status != 505 && status != 511;
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for HTTP status codes 505 (HTTP Version Not Supported) and 511 (Network Authentication Required), which are explicitly excluded from retry logic in the isStatusRetryWithBackoff method. While 501 is tested, add tests for 505 and 511 to ensure these status codes are not retried as intended.

Copilot uses AI. Check for mistakes.
Comment thread analytics/src/main/java/com/segment/analytics/Analytics.java Outdated
return status >= 500 && status < 600;
private static boolean isStatusRetryWithBackoff(int status) {
// Explicitly retry these client errors
if (status == 408 || status == 410 || status == 429 || status == 460) {
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP status 410 (Gone) is being treated as retryable, but this status code typically indicates that a resource has been permanently removed and will never be available again. Retrying such requests is unlikely to succeed and wastes resources. Consider removing 410 from the list of retryable client errors unless there's a specific reason the Segment API uses 410 in a non-standard way to indicate a temporary condition.

Suggested change
if (status == 408 || status == 410 || status == 429 || status == 460) {
if (status == 408 || status == 429 || status == 460) {

Copilot uses AI. Check for mistakes.
Comment on lines +542 to 561
private static Long parseRetryAfterSeconds(String headerValue) {
if (headerValue == null) {
return null;
}
headerValue = headerValue.trim();
if (headerValue.isEmpty()) {
return null;
}
try {
long seconds = Long.parseLong(headerValue);
if (seconds <= 0L) {
return null;
}
if (seconds > MAX_RETRY_AFTER_SECONDS) {
return MAX_RETRY_AFTER_SECONDS;
}
return seconds;
} catch (NumberFormatException ignored) {
return null;
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Retry-After header parsing only handles numeric (delay-seconds) format but not HTTP-date format. According to RFC 7231, the Retry-After header can be either a delay in seconds or an HTTP-date. While delay-seconds is more common, some servers may use HTTP-date format. Consider adding support for HTTP-date parsing to be fully RFC-compliant, or document that only delay-seconds format is supported.

Copilot uses AI. Check for mistakes.
Comment on lines +508 to +514
if (isStatusRetryWithBackoff(status)) {
client.log.print(
DEBUG, "Could not upload batch %s due to rate limiting. Retrying.", batch.sequence());
return true;
DEBUG,
"Could not upload batch %s due to retryable status %s. Retrying with backoff.",
batch.sequence(),
status);
return new UploadResult(RetryStrategy.BACKOFF);
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the fallback behavior when a status code is Retry-After eligible (429, 408, 503) but the Retry-After header is missing or invalid. In this case, the code should fall back to the BACKOFF strategy. Add a test to verify this fallback behavior works correctly, ensuring that requests with invalid Retry-After headers still get retried using exponential backoff instead of failing immediately.

Copilot uses AI. Check for mistakes.
MichaelGHSeg and others added 4 commits February 12, 2026 18:08
…csClient.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Remove 408/503 from Retry-After eligibility (only 429 uses Retry-After)
- Add 429 pipeline blocking: shared rate-limit state on AnalyticsClient
  (rateLimited, rateLimitWaitUntil, rateLimitStartTime), Looper checks
  before submitting BatchUploadTasks
- Add maxTotalBackoffDuration / maxRateLimitDuration config with 12h defaults,
  exposed via Analytics.Builder
- Restructure BatchUploadTask.run() retry loop: RATE_LIMITED strategy sets
  shared state and sleeps, BACKOFF strategy tracks firstFailureTime for
  duration guard, success clears rate-limit state
- Add tests: 408/503 use BACKOFF, 429 sets/clears rate-limit state,
  maxTotalBackoffDuration drops batch, maxRateLimitDuration drops batch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the Looper defers batch submission due to rate limiting and the
trigger was a batch-size overflow, the overflow message was being lost
because it had already been consumed from the queue but not added to
the batch. Now re-offers the overflow message back to the queue so it
can be retried on the next flush cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MichaelGHSeg MichaelGHSeg marked this pull request as ready for review February 25, 2026 19:13
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread analytics/src/main/java/com/segment/analytics/internal/AnalyticsClient.java Outdated
Comment thread analytics/src/main/java/com/segment/analytics/internal/AnalyticsClient.java Outdated
Comment thread analytics/src/main/java/com/segment/analytics/internal/AnalyticsClient.java Outdated
Comment on lines +639 to +645
// Check maxRateLimitDuration
if (client.rateLimitStartTime > 0
&& System.currentTimeMillis() - client.rateLimitStartTime
> client.maxRateLimitDurationMs) {
client.clearRateLimitState();
break;
}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maxRateLimitDuration check at lines 640-644 happens AFTER setRateLimitState is called (line 637). This means that if maxRateLimitDuration is very short (e.g., 1ms as tested), and the first 429 response is received, the rate limit state is set, then immediately the check might pass and the batch is dropped. However, the batch hasn't even attempted to wait for the Retry-After period yet (that happens at line 650).

Consider moving the maxRateLimitDuration check to AFTER the sleep, or restructure the logic so that the batch gets at least one chance to retry after the Retry-After period expires.

Copilot uses AI. Check for mistakes.
Comment on lines +428 to +429
// Adjusted upward to accommodate shorter max retry backoff.
maximumFlushAttempts = 1000;
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default maximumFlushAttempts has been increased from 3 to 1000. While the comment mentions this is to "accommodate shorter max retry backoff," this is a dramatic increase (over 300x) that significantly changes the retry behavior. Combined with the new maxTotalBackoffDurationMs limit (12 hours by default), this could result in many more retry attempts within that window.

Consider whether 1000 is the appropriate value, or if a more moderate increase would be sufficient. Also, ensure this change is documented in release notes as it significantly affects operational behavior.

Suggested change
// Adjusted upward to accommodate shorter max retry backoff.
maximumFlushAttempts = 1000;
// Increased from 3 to accommodate shorter max retry backoff while keeping retries bounded.
maximumFlushAttempts = 10;

Copilot uses AI. Check for mistakes.
Comment thread analytics/src/main/java/com/segment/analytics/internal/AnalyticsClient.java Outdated
Comment thread analytics/src/main/java/com/segment/analytics/internal/AnalyticsClient.java Outdated
Comment on lines +365 to +372
public Builder maxTotalBackoffDuration(long duration, TimeUnit unit) {
long seconds = unit.toSeconds(duration);
if (seconds < 1) {
throw new IllegalArgumentException("maxTotalBackoffDuration must be at least 1 second.");
}
this.maxTotalBackoffDurationMs = unit.toMillis(duration);
return this;
}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation converts the duration to seconds for validation but stores the value in milliseconds. This could allow an edge case where a duration less than 1 second but greater than 0 milliseconds (e.g., 500 milliseconds) would pass validation because unit.toSeconds(duration) would round down to 0 and fail the check. However, a duration of exactly 1000 milliseconds would pass (1 second) but a duration of 999 milliseconds would incorrectly fail even though it's valid in milliseconds.

Consider validating in milliseconds instead: if (unit.toMillis(duration) < 1000) or documenting that the minimum is 1 second (not 1 millisecond).

Copilot uses AI. Check for mistakes.
Comment on lines +378 to +385
public Builder maxRateLimitDuration(long duration, TimeUnit unit) {
long seconds = unit.toSeconds(duration);
if (seconds < 1) {
throw new IllegalArgumentException("maxRateLimitDuration must be at least 1 second.");
}
this.maxRateLimitDurationMs = unit.toMillis(duration);
return this;
}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation converts the duration to seconds for validation but stores the value in milliseconds. This could allow an edge case where a duration less than 1 second but greater than 0 milliseconds (e.g., 500 milliseconds) would pass validation because unit.toSeconds(duration) would round down to 0 and fail the check. However, a duration of exactly 1000 milliseconds would pass (1 second) but a duration of 999 milliseconds would incorrectly fail even though it's valid in milliseconds.

Consider validating in milliseconds instead: if (unit.toMillis(duration) < 1000) or documenting that the minimum is 1 second (not 1 millisecond).

Copilot uses AI. Check for mistakes.
Comment on lines +407 to +419
if (isRateLimited() && message != StopMessage.STOP) {
log.print(DEBUG, "Rate-limited. Deferring batch submission.");
// Don't clear messages — they'll be picked up on the next flush trigger
if (batchSizeLimitReached) {
// Preserve overflow message while deferring submission due to rate limiting.
// This message was consumed from the queue but not added to the current batch.
if (!messageQueue.offer(message)) {
log.print(
ERROR,
"Failed to preserve overflow message while rate-limited; message may be dropped.");
}
batchSizeLimitReached = false;
}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When rate-limited and the batch is deferred in the Looper (lines 407-419), the messages remain in the LinkedList but are never submitted if rate-limiting persists indefinitely. While the Looper thread will eventually process these on the next flush trigger or shutdown, if the rate-limit state lasts a very long time, these messages accumulate in memory.

Consider adding monitoring or logging to track how long messages remain in the deferred state, especially if rate-limiting persists for extended periods.

Copilot uses AI. Check for mistakes.
MichaelGHSeg and others added 4 commits February 25, 2026 16:32
- Add synchronized to rate-limit state methods to fix check-then-act
  race condition
- Add comments explaining defensive 429 and non-standard 460 in
  isStatusRetryWithBackoff
- Fix outdated test comments and line references
- Widen timing assertion upper bound (3s → 5s) to reduce CI flakiness

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
maxRetries was parsed from CLIConfig but never passed to the
Analytics.builder().retries() call, so the SDK always used its
default of 1000 retries regardless of test configuration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
// Sleep for Retry-After then retry this batch
if (result.retryAfterSeconds > 0) {
try {
TimeUnit.SECONDS.sleep(result.retryAfterSeconds);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a bad batch that keeps retrying will block other batches from sending?

MichaelGHSeg and others added 6 commits February 27, 2026 15:35
Align retry configuration with cross-library defaults:
- Base backoff: 100ms -> 500ms
- Max retries: 1000 -> 10

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Changed SegmentService.upload parameter from int to Integer (boxed) so
Retrofit omits the header when null. First attempt passes null; retries
pass attempt-1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The network executor was given only a 1-second termination timeout,
which is too short for batches undergoing retries with exponential
backoff. In-flight uploads would be abandoned silently, causing failure
callbacks to never fire and the CLI to report success=true when retries
were exhausted.

Introduces NETWORK_TERMINATION_TIMEOUT_S (30s) so shutdown waits long
enough for any in-flight batch upload — including its full retry budget
— to complete and invoke callbacks before the process exits.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Resolve java and mvn from PATH or devbox nix profile fallback
- Pass resolved $JAVA binary to run-tests.sh so it works outside devbox shell
- Add devbox setup as recommended path in README, document auto-fallback behavior
@MichaelGHSeg
Copy link
Copy Markdown
Contributor Author

Deep Code Review — response-status-updates

Three parallel review passes: correctness/bugs, test coverage, and design/edge cases.


🔴 Critical

1. continue at end of RATE_LIMITED block makes the no-header path bypass maxTotalBackoffDuration check

Both the retryAfterSeconds > 0 and else (no-header) branches end up at the continue on the last line of the RATE_LIMITED block. This means a 429 without Retry-After never falls through to the BACKOFF block's wall-clock guard (firstFailureTime / maxTotalBackoffDurationMs). These batches are only bounded by maxBackoffAttempts — the time-based safety valve has no effect for the most common 429 case.

2. client.rateLimitStartTime is read unsynchronized in the maxRateLimitDuration check

AnalyticsClient.java reads client.rateLimitStartTime directly (outside synchronized) in BatchUploadTask.run() after calling setRateLimitState(). Another thread can call clearRateLimitState() between those two lines, making rateLimitStartTime read as 0 and silently skipping the duration guard — allowing rate-limiting to run indefinitely past maxRateLimitDurationMs. This check should move into a single synchronized method on the client.

3. maxRetries=0 + valid Retry-After loops until maxRateLimitDuration

With maxRetries=0 (maxBackoffAttempts=1), the defensive if (backoffAttempts >= maxBackoffAttempts) break inside the valid-Retry-After branch never fires because backoffAttempts is never incremented there. A server that consistently returns 429 + Retry-After will cause this task to loop until maxRateLimitDuration expires (default 12 hours). There is no test for this case.


🟠 Important

4. clearRateLimitState() in BACKOFF InterruptedException handler clears another task's rate-limit state

The InterruptedException catch in the BACKOFF path calls client.clearRateLimitState(). BACKOFF is entered on 5xx/transient errors — unrelated to rate limiting. If Task B (on BACKOFF) is interrupted, it wipes the shared rateLimitWaitUntil and rateLimitStartTime set by Task A (currently sleeping its Retry-After delay), prematurely unblocking the Looper. The clearRateLimitState() call should be removed from the BACKOFF interrupt handler.

5. Shared rate-limit state corrupted by concurrent tasks with a custom networkExecutor

The default executor is single-threaded, so this is safe by default. But Analytics.Builder.networkExecutor() is a public API. With a multi-thread executor: Task A hits 429 → setRateLimitState(60); Task B succeeds → clearRateLimitState() via the RetryStrategy.NONE path. Task A's rate-limit window is wiped and the Looper re-submits into a still-rate-limited endpoint. Fix: only clear rate-limit state from within the same task that set it, or use a timestamp-based clearIfExpired that won't reset a still-live window.

6. RATE_LIMITED(retryAfterSeconds=0) sets state that is immediately auto-cleared — Looper deferral is a no-op

A 429 without Retry-After sets rateLimitWaitUntil = now + 0 = now. isRateLimited() immediately evaluates to false and auto-clears state, so the Looper is never actually blocked. These 429s are functionally BACKOFF, but the code goes through the RATE_LIMITED path, generating misleading log output ("Using backoff" but still setting global rate-limit state). Recommendation: return RetryStrategy.BACKOFF directly for a 429 without Retry-After, rather than routing through RATE_LIMITED(0).

7. NETWORK_TERMINATION_TIMEOUT_S=30 can silently abandon tasks sleeping on Retry-After

shutdownAndWait calls executor.shutdown() (not shutdownNow()) and waits 30 seconds. A task sleeping TimeUnit.SECONDS.sleep(120) for a Retry-After: 120 is not interrupted — it runs past the 30-second timeout, the executor is reported as "not terminated," and the task completes unsupervised after shutdown considers itself done. Callbacks may fire after the application has already exited. Recommendation: either call shutdownNow() on the network executor during shutdown to interrupt sleeping tasks, or document the accepted tradeoff. At minimum, add a comment explaining the 30s value relative to the 300s Retry-After cap.

8. Deferred messages can stall under low-traffic shutdown

When isRateLimited() returns true, the Looper holds messages in its local list with the comment "they'll be picked up on the next flush trigger." But if no new messages arrive and the flush scheduler is already stopped (shutdown in progress), those held messages never get a flush trigger. No test covers this Looper-stall scenario.

9. parseRetryAfterSeconds silently drops HTTP-date format with no warning

RFC 7231 allows Retry-After to be either a delay-seconds integer or an HTTP-date string. The NumberFormatException is silently swallowed and the code falls through to RATE_LIMITED(0) (counted backoff, not a timed sleep) with no log warning. The behavior difference is real and observable. No test covers this path.


🟡 Minor / Nitpick

10. Error message "X retries exhausted" uses totalAttempts, not maxRetries

When the valid-Retry-After path hits backoffAttempts >= maxBackoffAttempts and breaks, totalAttempts may be far larger than maxRetries. Callers inspecting the exception message see a confusing number with no indication that rate-limiting was involved. Consider separating the message for rate-limit exhaustion vs. backoff exhaustion.

11. backo.sleep(backoffAttempts - 1) — BACKOFF and no-header RATE_LIMITED share backoffAttempts counter

If a batch alternates between no-header 429s (RATE_LIMITED path) and 5xx (BACKOFF path), both paths increment the same backoffAttempts counter and advance the Backo slot index together. The delay schedule is not reset between strategy changes, which is likely unintentional.

12. 410 Gone marked retryable without justification

HTTP 410 means permanently removed. Retrying will never succeed against a standards-compliant server. The Node SDK includes it too, but there is no comment explaining why. Either add a comment explaining the rationale (e.g. "Segment API uses 410 transiently for...") or remove it. As-is, every reviewer will question it.

13. isStatusRetryWithBackoff includes a dead 429 case

429 is handled by isStatusRetryAfterEligible first and can never reach isStatusRetryWithBackoff. The defensive comment acknowledges this but the behavior difference (BACKOFF vs RATE_LIMITED, which skips global state) is not stated. Either remove the dead case or add a comment explaining what would break differently if it fired.

14. 60s Backo cap makes the 12-hour wall-clock budget mostly decorative

With base=500ms, cap=60s, maxRetries=10, the total backoff sleep is at most ~10 minutes. The maxTotalBackoffDurationMs budget of 12 hours never fires in normal operation. The two knobs are not co-documented, which makes it hard for callers to reason about actual retry duration. Document that maxTotalBackoffDurationMs is a safety ceiling that only activates at very high retry counts.

15. e2e-cli/Main.kt maxRetries=1000 is a magic number

The intent (let the test harness timeout govern, not retry exhaustion) is reasonable, but should be a named constant with a comment.


Test Coverage Gaps

Status Gap
❌ Missing 410 and 460 retryable — no tests
❌ Missing 505 and 511 non-retryable 5xx — only 501 is tested
❌ Missing 400, 401, 403 non-retryable 4xx — only 404 and 413 tested
❌ Missing parseRetryAfterSeconds with HTTP-date input
❌ Missing maxRetries=0 + valid Retry-After (infinite loop risk)
❌ Missing Looper deferral expiry/recovery (rate limit clears, deferred batch eventually submitted)
⚠️ Weak rateLimitStateSetOn429 — never asserts state is set, only that it's cleared afterward (duplicate of rateLimitStateClearedOnSuccess)
⚠️ Slow retryAfterHeaderRespectsShortDelay (2s), retryAfterDoesNotCountAgainstMaxRetries (~2s), maxRateLimitDurationDropsBatch (~1s) — none annotated as slow tests
⚠️ Missing Builder.maxTotalBackoffDuration/maxRateLimitDuration validation (< 1s throws)
⚠️ Missing isRateLimited() auto-expiry branch (never directly tested)

Overall: the structure is sound and the dual-budget retry semantics are well-designed. The critical issues (unsynchronized field access, continue-bypass of time guard, maxRetries=0 loop) should be fixed before merge. The RATE_LIMITED(0) semantic inconsistency and the clearRateLimitState() in the wrong interrupt handler are the next priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants