Document the SLURM workflow for running Iris core tests on scheduler-…#533
Merged
Conversation
…managed GPU clusters. Add a getting-started SLURM guide and make the batch wrapper more portable by removing user-specific paths and using generic scratch and log defaults.
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR improves the ability to run Iris core tests and examples on SLURM-managed GPU clusters by fixing the distributed pytest runner and adding documented, reusable SLURM batch wrappers.
Changes:
- Refactors
tests/run_tests_distributed.pyinto a launcher/worker that can self-invoketorchrun. - Adds SLURM batch scripts to run core tests and arbitrary example scripts inside a prebuilt Docker image, staging to node-local scratch and copying artifacts back.
- Updates docs to include a new SLURM guide and cross-links from installation/index pages.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
tests/run_tests_distributed.py |
Makes the distributed test runner usable both directly and under torchrun, and fixes rendezvous launching behavior. |
scripts/run_example_slurm.sh |
Adds a generic SLURM wrapper to run example scripts in Docker and persist logs/results. |
scripts/run_core_tests_slurm.sh |
Adds a SLURM wrapper to run the core test suite in Docker and persist logs. |
scripts/run_core_tests.sh |
Adds GPU visibility detection to skip rank configs that exceed allocated/visible GPUs; uses python3/python discovery. |
docs/index.md |
Links to the new SLURM guide from the docs landing page. |
docs/getting-started/slurm.md |
Adds a new guide describing recommended SLURM workflows and the provided scripts. |
docs/getting-started/installation.md |
Adds cross-links to the SLURM guide for scheduler-managed environments. |
Comment on lines
+95
to
+108
| copy_artifacts_and_cleanup() { | ||
| local exit_code=$1 | ||
|
|
||
| mkdir -p "$PERSIST_LOG_ROOT" | ||
| if [ -d "$WORKSPACE_DIR/logs" ]; then | ||
| rsync -a "$WORKSPACE_DIR/logs/" "$PERSIST_LOG_ROOT/logs/" || exit_code=$? | ||
| fi | ||
| if [ -d "$WORKSPACE_DIR/results" ]; then | ||
| rsync -a "$WORKSPACE_DIR/results/" "$PERSIST_LOG_ROOT/results/" || exit_code=$? | ||
| fi | ||
|
|
||
| rm -rf "$WORK_ROOT" | ||
| exit "$exit_code" | ||
| } |
|
|
||
|
|
||
| def _running_under_torchrun() -> bool: | ||
| return "RANK" in os.environ and "WORLD_SIZE" in os.environ |
Comment on lines
+85
to
+98
| docker run --rm \ | ||
| --name "$CONTAINER_NAME" \ | ||
| --label "$CONTAINER_LABEL" \ | ||
| --network=host \ | ||
| --ipc=host \ | ||
| --device=/dev/kfd \ | ||
| --device=/dev/dri \ | ||
| --group-add video \ | ||
| --cap-add=SYS_PTRACE \ | ||
| --security-opt seccomp=unconfined \ | ||
| --shm-size=16G \ | ||
| --ulimit memlock=-1 \ | ||
| --ulimit stack=67108864 \ | ||
| "${GPU_ENV_ARGS[@]}" \ |
|
|
||
| docker run --rm \ | ||
| --name "$CONTAINER_NAME" \ | ||
| --label "$CONTAINER_LABEL" \ |
mawad-amd
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
run_tests_distributed.pyinto dual-mode launcher/worker, so the SLURM job could actually launch tests correctly as it parses--num_ranksand spawn torch.distributed.run.Test Plan
I run unit tests and an example Iris program via the SLURM scripts on an AMD cluster.
Test Result
PASSED
Submission Checklist