Skip to content

Document the SLURM workflow for running Iris core tests on scheduler-…#533

Merged
artulab merged 1 commit into
mainfrom
artulab/slurm
May 12, 2026
Merged

Document the SLURM workflow for running Iris core tests on scheduler-…#533
artulab merged 1 commit into
mainfrom
artulab/slurm

Conversation

@artulab
Copy link
Copy Markdown
Collaborator

@artulab artulab commented May 12, 2026

Motivation

  • Turn the run_tests_distributed.py into dual-mode launcher/worker, so the SLURM job could actually launch tests correctly as it parses --num_ranks and spawn torch.distributed.run.
  • Document the SLURM workflow for running Iris core tests and example Iris programs
  • Create general-enough SLURM scripts for a single node

Test Plan

I run unit tests and an example Iris program via the SLURM scripts on an AMD cluster.

Test Result

PASSED

Submission Checklist

…managed GPU clusters. Add a getting-started SLURM guide and make the batch wrapper more portable by removing user-specific paths and using generic scratch and log defaults.
Copilot AI review requested due to automatic review settings May 12, 2026 02:08
@github-actions github-actions Bot added in-progress We are working on it iris Iris project issue labels May 12, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves the ability to run Iris core tests and examples on SLURM-managed GPU clusters by fixing the distributed pytest runner and adding documented, reusable SLURM batch wrappers.

Changes:

  • Refactors tests/run_tests_distributed.py into a launcher/worker that can self-invoke torchrun.
  • Adds SLURM batch scripts to run core tests and arbitrary example scripts inside a prebuilt Docker image, staging to node-local scratch and copying artifacts back.
  • Updates docs to include a new SLURM guide and cross-links from installation/index pages.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/run_tests_distributed.py Makes the distributed test runner usable both directly and under torchrun, and fixes rendezvous launching behavior.
scripts/run_example_slurm.sh Adds a generic SLURM wrapper to run example scripts in Docker and persist logs/results.
scripts/run_core_tests_slurm.sh Adds a SLURM wrapper to run the core test suite in Docker and persist logs.
scripts/run_core_tests.sh Adds GPU visibility detection to skip rank configs that exceed allocated/visible GPUs; uses python3/python discovery.
docs/index.md Links to the new SLURM guide from the docs landing page.
docs/getting-started/slurm.md Adds a new guide describing recommended SLURM workflows and the provided scripts.
docs/getting-started/installation.md Adds cross-links to the SLURM guide for scheduler-managed environments.

Comment on lines +95 to +108
copy_artifacts_and_cleanup() {
local exit_code=$1

mkdir -p "$PERSIST_LOG_ROOT"
if [ -d "$WORKSPACE_DIR/logs" ]; then
rsync -a "$WORKSPACE_DIR/logs/" "$PERSIST_LOG_ROOT/logs/" || exit_code=$?
fi
if [ -d "$WORKSPACE_DIR/results" ]; then
rsync -a "$WORKSPACE_DIR/results/" "$PERSIST_LOG_ROOT/results/" || exit_code=$?
fi

rm -rf "$WORK_ROOT"
exit "$exit_code"
}


def _running_under_torchrun() -> bool:
return "RANK" in os.environ and "WORLD_SIZE" in os.environ
Comment on lines +85 to +98
docker run --rm \
--name "$CONTAINER_NAME" \
--label "$CONTAINER_LABEL" \
--network=host \
--ipc=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--shm-size=16G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
"${GPU_ENV_ARGS[@]}" \

docker run --rm \
--name "$CONTAINER_NAME" \
--label "$CONTAINER_LABEL" \
@artulab artulab merged commit abd3f2a into main May 12, 2026
47 of 50 checks passed
@artulab artulab deleted the artulab/slurm branch May 12, 2026 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants