Skip to content

feat: add automatic S3 request retry with exponential backoff#1701

Draft
allanrogerr wants to merge 11 commits intominio:masterfrom
allanrogerr:feature/retry-mechanism
Draft

feat: add automatic S3 request retry with exponential backoff#1701
allanrogerr wants to merge 11 commits intominio:masterfrom
allanrogerr:feature/retry-mechanism

Conversation

@allanrogerr
Copy link
Copy Markdown

@allanrogerr allanrogerr commented May 4, 2026

Description

Adds SDK-level automatic retry on transient HTTP failures via a single OkHttp interceptor.

Http.RetryInterceptor (~45 lines) is installed on the default client built by Http.newDefaultClient(). It retries on retryable HTTP status codes with exponential backoff and full jitter, then returns the final response (whether eventually-successful or terminal). IOException retry is delegated to OkHttp's default retryOnConnectionFailure(true) — no overlap, no SDK-level orchestration.

Retryable HTTP status codes (8): 408, 429, 499, 500, 502, 503, 504, 520.
Backoff: rand(0, min(30 s, 200 ms × 2ⁿ)) before the nth retry, with n capped at 30 to avoid bit-shift overflow and the random bound guarded with Math.max(1, …) to avoid the zero-bound edge case.
Max retries: 5 additional attempts after the initial request (so up to 6 total).
Region/HEAD redirect: untouched — executeHeadAsync region rediscovery still works.

Motivation and Context

The SDK had no retry capability for transient HTTP failures. Throttling (503), gateway errors (502/504), request timeouts (408), and Cloudflare 520s propagated directly to callers, requiring every consumer to implement its own retry loop.

This PR went through several iterations. The first design was a CompletableFuture-based wrapper with a dedicated scheduler thread, an S3-error-code classifier, an IOException-retry helper, and a public setMaxRetries API. Per @balamurugana's review, that was stripped back to the simpler interceptor approach: status-code-only, hard-coded params, no public API surface, no separate classifier class. The current PR is +46 / −668 lines vs the prior head — a net reduction of 622 lines.

Does not address #1700: error code BucketNotEmpty (HTTP 409) is not natively retriable in minio-go and stays out of the retryable set here.

How to test this PR

  1. Unit + lint (Java 17):

    ./gradlew :api:check
    

    Runs the 2 integration smoke tests in api/src/test/java/io/minio/RetryTest.java (retry on 503→200, no retry on 404), plus spotbugs/spotless.

  2. Functional tests against a local server:

    MINT_MODE=quick ./gradlew :functional:runFunctionalTest \
      -Pendpoint=http://localhost:9000 \
      -PaccessKey=minioadmin -PsecretKey=minioadmin -Pregion=us-east-1
    
  3. Mint suite via Docker:

    docker run --rm --network host \
      -e SERVER_ENDPOINT="127.0.0.1:9000" \
      -e ACCESS_KEY="minioadmin" -e SECRET_KEY="minioadmin" \
      -e ENABLE_HTTPS=0 -e MINT_MODE="full" \
      minio/mint:local minio-java
    

Validation performed

Check Result
:api:check (unit + spotbugs + spotless, Java 17) ✅ green
:functional:runFunctionalTest against local MinIO (quick mode) ✅ passes through bucket / object / encryption / lifecycle / policy / CORS coverage
Mint suite (erasure mode, JARs from this branch overlayed into mint container) All tests ran successfully, PASS in 42.8 s

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Optimization (provides speedup with no functional changes)
  • Cleanup/Maintenance (no functional changes)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • Fixes a regression (If yes, please add commit-id or PR # here)
  • Unit tests added/updated
  • Internal documentation updated
  • Create a documentation update request here
  • No internode changes / version interoperability introduced

@allanrogerr allanrogerr added the test-framework Tests and test framework changes label May 4, 2026
@allanrogerr allanrogerr self-assigned this May 4, 2026
Implements SDK-level automatic retry matching the minio-go retry-mechanism
spec. All requests through BaseS3Client.executeAsync() now retry on
retryable S3 error codes (InternalError, SlowDown, RequestTimeout, etc.),
retryable HTTP status codes (408, 429, 499, 500, 502, 503, 504, 520), and
transient IO errors (connection reset, EOF, server closed idle connection).

Non-seekable request bodies (raw okhttp3.RequestBody) get a single attempt;
seekable bodies (byte[], ByteBuffer, RandomAccessFile) retry up to maxRetries
(default 10). Backoff is full-jitter exponential: rand(0, min(1s, 200ms*2^n))
before the nth retry, matching minio-go's DefaultRetryUnit/DefaultRetryCap.
The "RetryHead" S3 code is explicitly excluded so executeHeadAsync region
redirect logic is unaffected. maxRetries is configurable via builder and
setMaxRetries() post-construction.

Closes minio#1700.
@allanrogerr allanrogerr force-pushed the feature/retry-mechanism branch from 49fda52 to 9524c1f Compare May 4, 2026 11:50
@allanrogerr allanrogerr marked this pull request as ready for review May 4, 2026 13:37
@allanrogerr allanrogerr requested review from Copilot and rraulinio May 4, 2026 13:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces SDK-level automatic retry for S3 requests with full-jitter exponential backoff, aligning the Java SDK’s behavior with minio-go’s retry policy and making transient S3/HTTP/IO failures less likely to leak to callers.

Changes:

  • Added a new package-private Retry helper for retryability classification (S3 codes, HTTP status codes, IO exceptions) and jittered exponential backoff.
  • Wired a retry loop into BaseS3Client.executeAsync() with delayed retries via a daemon ScheduledExecutorService, plus a seekability-based single-attempt gate for non-replayable bodies.
  • Exposed configuration via maxRetries(...) on builders and setMaxRetries(...), and added an InvalidResponseException.responseCode() accessor plus a comprehensive new RetryTest suite.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
api/src/main/java/io/minio/BaseS3Client.java Adds execute-with-retry loop, backoff scheduling, and max-retry configuration.
api/src/main/java/io/minio/Retry.java Defines retryable error sets and backoff computation.
api/src/main/java/io/minio/MinioAsyncClient.java Adds maxRetries(...) builder option and applies it to the built client.
api/src/main/java/io/minio/MinioClient.java Exposes setMaxRetries(...) and maxRetries(...) on the synchronous client/builder.
api/src/main/java/io/minio/Http.java Adds Http.S3Request.body() accessor used for retry seekability gating.
api/src/main/java/io/minio/errors/InvalidResponseException.java Stores/returns the triggering HTTP response code for retry classification.
api/src/test/java/io/minio/RetryTest.java Adds unit + MockWebServer integration tests covering retry classification and behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread api/src/main/java/io/minio/BaseS3Client.java Outdated
Comment thread api/src/main/java/io/minio/BaseS3Client.java Outdated
Comment thread api/src/test/java/io/minio/RetryTest.java Outdated
Comment thread api/src/main/java/io/minio/Retry.java Outdated
Comment thread api/src/main/java/io/minio/MinioClient.java Outdated
Comment thread api/src/main/java/io/minio/MinioAsyncClient.java Outdated
- Http.Body now captures the initial file pointer (fileOffset) when
  created for RandomAccessFile bodies; toRequestBody() seeks back to
  that offset before each attempt, fixing both first-attempt and retry
  payload corruption when checksum computation has advanced the pointer
- BaseS3Client.createBody() explicitly restores the file pointer after
  computing checksums so Http.Body captures the correct starting offset
- Retry scheduler now dispatches actual retry work (credential fetch +
  request build) to ForkJoinPool.commonPool(), freeing the single
  scheduler thread from head-of-line blocking by slow providers
- Replace package-private {@link Retry#MAX_RETRY} Javadoc references in
  MinioClient and MinioAsyncClient with {@code 10} to satisfy doclint
- Fix RETRYABLE_HTTP_CODES closing paren indentation to match codebase
  style (ImmutableSet.of() in Http.java and Signer.java)
- Add testRandomAccessFileBodyRetried integration test: verifies that
  both the failed and retried PUT requests carry identical body content
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread api/src/main/java/io/minio/BaseS3Client.java Outdated
Comment thread api/src/main/java/io/minio/Http.java
Comment thread api/src/test/java/io/minio/RetryTest.java Outdated
Comment thread api/src/test/java/io/minio/RetryTest.java Outdated
Comment thread api/src/test/java/io/minio/RetryTest.java Outdated
…he.remove

The no-retry tests for 404/403 were asserting only assertNotNull, which
would pass for any exception including NPEs.  Stronger assertions now
verify ErrorResponseException type and S3 error code/HTTP status, and
the retry-exhausted test verifies InvalidResponseException with status
500.

While adding those assertions, the tests revealed a pre-existing NPE:
regionCache.remove(s3request.bucket()) was called unconditionally when
the S3 error code was NoSuchBucket/RetryHead, but bucket is null for
operations like listBuckets().  Added a null guard so the cache
invalidation is skipped when there is no bucket in the request.
@allanrogerr allanrogerr requested a review from Copilot May 5, 2026 19:12
@allanrogerr allanrogerr review requested due to automatic review settings May 5, 2026 19:13
…gs JUA

SpotBugs JUA flags Assert.assertTrue(e instanceof T) because the message
hides what type was actually thrown on failure.  Using a typed catch block
is stricter: any unexpected exception type propagates as a test error with
the full stack trace, which is more informative than a failed boolean
assertion.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread api/src/main/java/io/minio/BaseS3Client.java Outdated
Comment thread api/src/main/java/io/minio/BaseS3Client.java Outdated
Comment thread api/src/test/java/io/minio/RetryTest.java Outdated
Comment thread api/src/test/java/io/minio/RetryTest.java Outdated
…strengthen test assertions

Replace the shared static Random instance with ThreadLocalRandom.current()
at the backoff call site and for the fan-out name generator in
MinioAsyncClient — eliminating shared-state CAS contention under concurrent
retry load.

Disable OkHttp's retryOnConnectionFailure in doExecuteAsync so the SDK's
executeWithRetry is the sole retry policy; previously OkHttp could silently
add an extra attempt on stale-connection IOExceptions, causing more total
attempts than maxRetries.

Strengthen the two remaining assertNotNull catch blocks in RetryTest
(testMaxRetriesOneDisablesRetry, testSetMaxRetriesPostConstruction) to typed
InvalidResponseException catches with responseCode()==500 assertions,
matching the pattern applied to the other tests in the previous commit.
@allanrogerr allanrogerr requested a review from Copilot May 5, 2026 20:27
@allanrogerr allanrogerr review requested due to automatic review settings May 5, 2026 20:32
Reorder ThreadLocalRandom import to match alphabetical sort order and
collapse the two-line OkHttpClient builder call to one line, satisfying
the project's spotlessJavaCheck formatter requirements.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread api/src/main/java/io/minio/BaseS3Client.java Outdated
If ForkJoinPool.commonPool() rejects the retry task (e.g. during JVM
shutdown), the runAsync future completes exceptionally but retryFuture
was never completed, causing the overall request future to hang. Wire
an .exceptionally() handler on the runAsync future so any dispatch
failure propagates into retryFuture immediately.
Comment thread api/src/main/java/io/minio/errors/InvalidResponseException.java
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Per @balamurugana's review on PR minio#1701, replace the
CompletableFuture/scheduler-based retry orchestration with an OkHttp
interceptor (Http.RetryInterceptor) that retries on retryable HTTP
status codes (408/429/499/500/502/503/504/520) and IOExceptions with
exponential backoff and full jitter.

- Drop S3-error-code retry classification; well-behaved S3/MinIO
  servers return retryable conditions with non-2xx HTTP status, so
  RETRYABLE_S3_CODES is redundant.
- Drop RETRY_SCHEDULER and the executeWithRetry recursive chain.
- Non-replayable bodies (raw okhttp3.RequestBody) are sent once via
  Http.RequestBody.isReplayable(); byte[]/ByteBuffer/RandomAccessFile
  bodies replay safely because writeTo rewinds the source.
- Wire interceptor at construction (BaseS3Client.wrapWithRetry) so
  setMaxRetries() takes effect on subsequent requests.
Replace `throws Exception` with the actually-thrown checked exceptions
(`IOException, MinioException`, plus `InterruptedException` for tests
calling `MockWebServer.takeRequest()`). Drops `throws` entirely on
`testSetMaxRetriesValidation` since it throws only an unchecked
`IllegalArgumentException`.

Fixes spotbugsTest THROWS_METHOD_THROWS_CLAUSE_BASIC_EXCEPTION
findings on Java 8/11/17/21/25.
@allanrogerr allanrogerr marked this pull request as draft May 7, 2026 17:10
Per @balamurugana's review on PR minio#1701, replace the
CompletableFuture/scheduler retry orchestration with a self-contained
OkHttp interceptor that retries on retryable HTTP status codes with
exponential backoff and full jitter.

- Http.RetryInterceptor: hard-coded MAX_RETRIES=5, BASE_DELAY_MS=200,
  MAX_DELAY_MS=30s, RETRYABLE_STATUS_CODES = 408/429/499/500/502/503/
  504/520. Bit-shift exponent capped to avoid overflow at high attempt
  counts; jitter bound guarded with Math.max(1, cap).
- Wired into Http.newDefaultClient() with a single addInterceptor call.
- Drop public maxRetries API on MinioClient/MinioAsyncClient/BaseS3Client
  and Builder.maxRetries; the interceptor's params are constants.
- Drop Retry.java entirely — RETRYABLE_HTTP_CODES inlined; the S3-error-
  code retry set, isRetryable(Throwable), isRetryableIOException, and
  computeBackoffMs helpers are no longer needed.
- IOException retry now relies on OkHttp's default
  retryOnConnectionFailure(true); only the interceptor handles status
  codes. No double-retry, no overlap.
- Drop Http.RequestBody.isReplayable() and the body-replayability gate;
  the raw okhttp3 RequestBody body path is unused in production code,
  and byte[]/ByteBuffer/RandomAccessFile bodies replay safely via
  writeTo().
- Drop InvalidResponseException.responseCode() accessor (added solely
  for the deleted retry classifier).
- RetryTest: trim from 539 lines to two integration smoke tests
  (retry on 503->200, no retry on 404).

Validated locally:
- :api:check (unit + spotbugs + spotless) green on Java 17
- :functional:runFunctionalTest passes through CORS test (modulo a
  pre-existing CORSConfiguration.equals() bug unrelated to this PR)
- mint suite (overlayed JARs in mint container, erasure mode):
  "All tests ran successfully", PASS in 42.8s
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment on lines +689 to +693
/**
* OkHttp interceptor that retries requests on retryable HTTP status codes with exponential
* backoff and full jitter.
*/
public static class RetryInterceptor implements Interceptor {
Comment on lines +718 to +727
// Cap exponent to avoid bit-shift overflow at high attempt counts.
int exp = Math.min(tryCount, 30);
long backoffCap = Math.min(MAX_DELAY_MS, BASE_DELAY_MS * (1L << exp));
long jittered = ThreadLocalRandom.current().nextLong(0, Math.max(1, backoffCap));
try {
Thread.sleep(jittered);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
throw new IOException("retry interrupted", ie);
}
/** Execute HTTP request asynchronously for given parameters. */
/**
* Execute HTTP request asynchronously for given parameters. Retries on retryable HTTP status
* codes are handled by {@link Http.RetryInterceptor} installed on the default HTTP client.
Comment on lines +50 to +62
@Test
public void testRetryOnRetryableStatusThenSuccess() throws IOException, MinioException {
try (MockWebServer server = new MockWebServer()) {
server.enqueue(retryableServerError());
server.enqueue(successResponse());
server.start();

MinioClient client = MinioClient.builder().endpoint(server.url("").toString()).build();
client.listBuckets();

Assert.assertEquals(2, server.getRequestCount());
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test-framework Tests and test framework changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants