feat: add automatic S3 request retry with exponential backoff#1701
feat: add automatic S3 request retry with exponential backoff#1701allanrogerr wants to merge 11 commits intominio:masterfrom
Conversation
Implements SDK-level automatic retry matching the minio-go retry-mechanism spec. All requests through BaseS3Client.executeAsync() now retry on retryable S3 error codes (InternalError, SlowDown, RequestTimeout, etc.), retryable HTTP status codes (408, 429, 499, 500, 502, 503, 504, 520), and transient IO errors (connection reset, EOF, server closed idle connection). Non-seekable request bodies (raw okhttp3.RequestBody) get a single attempt; seekable bodies (byte[], ByteBuffer, RandomAccessFile) retry up to maxRetries (default 10). Backoff is full-jitter exponential: rand(0, min(1s, 200ms*2^n)) before the nth retry, matching minio-go's DefaultRetryUnit/DefaultRetryCap. The "RetryHead" S3 code is explicitly excluded so executeHeadAsync region redirect logic is unaffected. maxRetries is configurable via builder and setMaxRetries() post-construction. Closes minio#1700.
49fda52 to
9524c1f
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces SDK-level automatic retry for S3 requests with full-jitter exponential backoff, aligning the Java SDK’s behavior with minio-go’s retry policy and making transient S3/HTTP/IO failures less likely to leak to callers.
Changes:
- Added a new package-private
Retryhelper for retryability classification (S3 codes, HTTP status codes, IO exceptions) and jittered exponential backoff. - Wired a retry loop into
BaseS3Client.executeAsync()with delayed retries via a daemonScheduledExecutorService, plus a seekability-based single-attempt gate for non-replayable bodies. - Exposed configuration via
maxRetries(...)on builders andsetMaxRetries(...), and added anInvalidResponseException.responseCode()accessor plus a comprehensive newRetryTestsuite.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
api/src/main/java/io/minio/BaseS3Client.java |
Adds execute-with-retry loop, backoff scheduling, and max-retry configuration. |
api/src/main/java/io/minio/Retry.java |
Defines retryable error sets and backoff computation. |
api/src/main/java/io/minio/MinioAsyncClient.java |
Adds maxRetries(...) builder option and applies it to the built client. |
api/src/main/java/io/minio/MinioClient.java |
Exposes setMaxRetries(...) and maxRetries(...) on the synchronous client/builder. |
api/src/main/java/io/minio/Http.java |
Adds Http.S3Request.body() accessor used for retry seekability gating. |
api/src/main/java/io/minio/errors/InvalidResponseException.java |
Stores/returns the triggering HTTP response code for retry classification. |
api/src/test/java/io/minio/RetryTest.java |
Adds unit + MockWebServer integration tests covering retry classification and behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Http.Body now captures the initial file pointer (fileOffset) when
created for RandomAccessFile bodies; toRequestBody() seeks back to
that offset before each attempt, fixing both first-attempt and retry
payload corruption when checksum computation has advanced the pointer
- BaseS3Client.createBody() explicitly restores the file pointer after
computing checksums so Http.Body captures the correct starting offset
- Retry scheduler now dispatches actual retry work (credential fetch +
request build) to ForkJoinPool.commonPool(), freeing the single
scheduler thread from head-of-line blocking by slow providers
- Replace package-private {@link Retry#MAX_RETRY} Javadoc references in
MinioClient and MinioAsyncClient with {@code 10} to satisfy doclint
- Fix RETRYABLE_HTTP_CODES closing paren indentation to match codebase
style (ImmutableSet.of() in Http.java and Signer.java)
- Add testRandomAccessFileBodyRetried integration test: verifies that
both the failed and retried PUT requests carry identical body content
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…he.remove The no-retry tests for 404/403 were asserting only assertNotNull, which would pass for any exception including NPEs. Stronger assertions now verify ErrorResponseException type and S3 error code/HTTP status, and the retry-exhausted test verifies InvalidResponseException with status 500. While adding those assertions, the tests revealed a pre-existing NPE: regionCache.remove(s3request.bucket()) was called unconditionally when the S3 error code was NoSuchBucket/RetryHead, but bucket is null for operations like listBuckets(). Added a null guard so the cache invalidation is skipped when there is no bucket in the request.
…gs JUA SpotBugs JUA flags Assert.assertTrue(e instanceof T) because the message hides what type was actually thrown on failure. Using a typed catch block is stricter: any unexpected exception type propagates as a test error with the full stack trace, which is more informative than a failed boolean assertion.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…strengthen test assertions Replace the shared static Random instance with ThreadLocalRandom.current() at the backoff call site and for the fan-out name generator in MinioAsyncClient — eliminating shared-state CAS contention under concurrent retry load. Disable OkHttp's retryOnConnectionFailure in doExecuteAsync so the SDK's executeWithRetry is the sole retry policy; previously OkHttp could silently add an extra attempt on stale-connection IOExceptions, causing more total attempts than maxRetries. Strengthen the two remaining assertNotNull catch blocks in RetryTest (testMaxRetriesOneDisablesRetry, testSetMaxRetriesPostConstruction) to typed InvalidResponseException catches with responseCode()==500 assertions, matching the pattern applied to the other tests in the previous commit.
Reorder ThreadLocalRandom import to match alphabetical sort order and collapse the two-line OkHttpClient builder call to one line, satisfying the project's spotlessJavaCheck formatter requirements.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
If ForkJoinPool.commonPool() rejects the retry task (e.g. during JVM shutdown), the runAsync future completes exceptionally but retryFuture was never completed, causing the overall request future to hang. Wire an .exceptionally() handler on the runAsync future so any dispatch failure propagates into retryFuture immediately.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Per @balamurugana's review on PR minio#1701, replace the CompletableFuture/scheduler-based retry orchestration with an OkHttp interceptor (Http.RetryInterceptor) that retries on retryable HTTP status codes (408/429/499/500/502/503/504/520) and IOExceptions with exponential backoff and full jitter. - Drop S3-error-code retry classification; well-behaved S3/MinIO servers return retryable conditions with non-2xx HTTP status, so RETRYABLE_S3_CODES is redundant. - Drop RETRY_SCHEDULER and the executeWithRetry recursive chain. - Non-replayable bodies (raw okhttp3.RequestBody) are sent once via Http.RequestBody.isReplayable(); byte[]/ByteBuffer/RandomAccessFile bodies replay safely because writeTo rewinds the source. - Wire interceptor at construction (BaseS3Client.wrapWithRetry) so setMaxRetries() takes effect on subsequent requests.
Replace `throws Exception` with the actually-thrown checked exceptions (`IOException, MinioException`, plus `InterruptedException` for tests calling `MockWebServer.takeRequest()`). Drops `throws` entirely on `testSetMaxRetriesValidation` since it throws only an unchecked `IllegalArgumentException`. Fixes spotbugsTest THROWS_METHOD_THROWS_CLAUSE_BASIC_EXCEPTION findings on Java 8/11/17/21/25.
Per @balamurugana's review on PR minio#1701, replace the CompletableFuture/scheduler retry orchestration with a self-contained OkHttp interceptor that retries on retryable HTTP status codes with exponential backoff and full jitter. - Http.RetryInterceptor: hard-coded MAX_RETRIES=5, BASE_DELAY_MS=200, MAX_DELAY_MS=30s, RETRYABLE_STATUS_CODES = 408/429/499/500/502/503/ 504/520. Bit-shift exponent capped to avoid overflow at high attempt counts; jitter bound guarded with Math.max(1, cap). - Wired into Http.newDefaultClient() with a single addInterceptor call. - Drop public maxRetries API on MinioClient/MinioAsyncClient/BaseS3Client and Builder.maxRetries; the interceptor's params are constants. - Drop Retry.java entirely — RETRYABLE_HTTP_CODES inlined; the S3-error- code retry set, isRetryable(Throwable), isRetryableIOException, and computeBackoffMs helpers are no longer needed. - IOException retry now relies on OkHttp's default retryOnConnectionFailure(true); only the interceptor handles status codes. No double-retry, no overlap. - Drop Http.RequestBody.isReplayable() and the body-replayability gate; the raw okhttp3 RequestBody body path is unused in production code, and byte[]/ByteBuffer/RandomAccessFile bodies replay safely via writeTo(). - Drop InvalidResponseException.responseCode() accessor (added solely for the deleted retry classifier). - RetryTest: trim from 539 lines to two integration smoke tests (retry on 503->200, no retry on 404). Validated locally: - :api:check (unit + spotbugs + spotless) green on Java 17 - :functional:runFunctionalTest passes through CORS test (modulo a pre-existing CORSConfiguration.equals() bug unrelated to this PR) - mint suite (overlayed JARs in mint container, erasure mode): "All tests ran successfully", PASS in 42.8s
| /** | ||
| * OkHttp interceptor that retries requests on retryable HTTP status codes with exponential | ||
| * backoff and full jitter. | ||
| */ | ||
| public static class RetryInterceptor implements Interceptor { |
| // Cap exponent to avoid bit-shift overflow at high attempt counts. | ||
| int exp = Math.min(tryCount, 30); | ||
| long backoffCap = Math.min(MAX_DELAY_MS, BASE_DELAY_MS * (1L << exp)); | ||
| long jittered = ThreadLocalRandom.current().nextLong(0, Math.max(1, backoffCap)); | ||
| try { | ||
| Thread.sleep(jittered); | ||
| } catch (InterruptedException ie) { | ||
| Thread.currentThread().interrupt(); | ||
| throw new IOException("retry interrupted", ie); | ||
| } |
| /** Execute HTTP request asynchronously for given parameters. */ | ||
| /** | ||
| * Execute HTTP request asynchronously for given parameters. Retries on retryable HTTP status | ||
| * codes are handled by {@link Http.RetryInterceptor} installed on the default HTTP client. |
| @Test | ||
| public void testRetryOnRetryableStatusThenSuccess() throws IOException, MinioException { | ||
| try (MockWebServer server = new MockWebServer()) { | ||
| server.enqueue(retryableServerError()); | ||
| server.enqueue(successResponse()); | ||
| server.start(); | ||
|
|
||
| MinioClient client = MinioClient.builder().endpoint(server.url("").toString()).build(); | ||
| client.listBuckets(); | ||
|
|
||
| Assert.assertEquals(2, server.getRequestCount()); | ||
| } | ||
| } |
Description
Adds SDK-level automatic retry on transient HTTP failures via a single OkHttp interceptor.
Http.RetryInterceptor(~45 lines) is installed on the default client built byHttp.newDefaultClient(). It retries on retryable HTTP status codes with exponential backoff and full jitter, then returns the final response (whether eventually-successful or terminal). IOException retry is delegated to OkHttp's defaultretryOnConnectionFailure(true)— no overlap, no SDK-level orchestration.Retryable HTTP status codes (8): 408, 429, 499, 500, 502, 503, 504, 520.
Backoff:
rand(0, min(30 s, 200 ms × 2ⁿ))before the nth retry, withncapped at 30 to avoid bit-shift overflow and the random bound guarded withMath.max(1, …)to avoid the zero-bound edge case.Max retries: 5 additional attempts after the initial request (so up to 6 total).
Region/HEAD redirect: untouched —
executeHeadAsyncregion rediscovery still works.Motivation and Context
The SDK had no retry capability for transient HTTP failures. Throttling (503), gateway errors (502/504), request timeouts (408), and Cloudflare 520s propagated directly to callers, requiring every consumer to implement its own retry loop.
This PR went through several iterations. The first design was a
CompletableFuture-based wrapper with a dedicated scheduler thread, an S3-error-code classifier, an IOException-retry helper, and a publicsetMaxRetriesAPI. Per @balamurugana's review, that was stripped back to the simpler interceptor approach: status-code-only, hard-coded params, no public API surface, no separate classifier class. The current PR is +46 / −668 lines vs the prior head — a net reduction of 622 lines.Does not address #1700: error code
BucketNotEmpty(HTTP 409) is not natively retriable in minio-go and stays out of the retryable set here.How to test this PR
Unit + lint (Java 17):
Runs the 2 integration smoke tests in
api/src/test/java/io/minio/RetryTest.java(retry on 503→200, no retry on 404), plus spotbugs/spotless.Functional tests against a local server:
Mint suite via Docker:
Validation performed
:api:check(unit + spotbugs + spotless, Java 17):functional:runFunctionalTestagainst local MinIO (quick mode)Types of changes
Checklist:
commit-idorPR #here)