Training Gym SDK¶
Important
Modal's multi-node cluster training product is in early preview and not generally accessible. Please contact us for access.
Distributed training on Modal without hand-rolling a
launcher each time. Pick a training framework (slime, verl, Megatron,
MS-SWIFT, Lightning, HF Accelerate, or raw torchrun), plug in a model +
dataset config, and modal run it — training-gym handles the image, the
cluster topology, the Ray/NCCL bring-up, volume mounts, and checkpointing.
Packaged as modal-training-gym — pip-install once, then import
framework-specific launchers from your own scripts or notebooks. Every
tutorial is a runnable .py file and a matching .ipynb with the same
steps narrated cell-by-cell — the notebook is the place to read the
walkthrough; this README is the map.
Install¶
In a notebook or script:
! pip install -q git+https://github.com/modal-projects/training-gym.git@joy/initial-setup
Every generated tutorial notebook has this line as its first code cell.
Quickstart¶
1. Validate your Modal setup. Before launching anything expensive, run a 2 × 8 × H100 NCCL all-reduce to confirm multi-node training works in your workspace:
uv run modal run --detach tutorials/misc/nccl_benchmark/nccl_benchmark.py::run_benchmark
2. Run a tutorial. Qwen3-4B GRPO on GSM8K using SLIME:
uv run modal run tutorials/rl/slime_gsm8k/slime_gsm8k.py::app.download_model
uv run modal run tutorials/rl/slime_gsm8k/slime_gsm8k.py::app.prepare_dataset
uv run modal run --detach tutorials/rl/slime_gsm8k/slime_gsm8k.py::app.train
Or open the matching .ipynb in Jupyter / Modal Notebooks and run
cell-by-cell — each notebook is a self-contained walkthrough. See
tutorials/README.md for the full catalog.
Pick your framework¶
Each framework package exposes build_<name>_app(modal=..., config=...) —
a factory that returns a modal.App with download_model,
prepare_dataset, and train functions. Shared container objects
(DatasetConfig, Model, WandbConfig) plug into the framework config;
each framework translates them into its own CLI vocabulary.
| Framework | Good for | Abstraction | Example |
|---|---|---|---|
torchrun |
Any torchrun-compatible script; BYO training loop |
Thin — cluster + launch only | starcoder_llama2_7b |
hf_accelerate |
Accelerate-based SFT, FSDP | Thin | starcoder_llama2_7b |
lightning |
PyTorch Lightning Fabric scripts | Thin | lightning_fabric_demo |
ms_swift |
LoRA / full SFT via ModelScope SWIFT (HF or Megatron backend) | Opinionated | ms_swift_glm_4_7_gsm8k, ms_swift_custom_hf |
megatron |
Full-parameter training on Megatron-LM (TP / PP / EP) | Opinionated | megatron_glm_4_7_longmit128k |
slime |
GRPO / RL post-training — Ray + Megatron + SGLang | Opinionated | slime_gsm8k, slime_haiku |
verl |
GRPO / RL post-training — Ray + Megatron + vLLM | Opinionated | verl_qwen3_32b_gsm8k |
"Thin" launchers give you a cluster and a torchrun — bring your own
training script. "Opinionated" launchers wrap a specific upstream framework
and expect you to configure it via that framework's CLI/YAML vocabulary.
Source in modal_training_gym/frameworks/. Runnable examples in
tutorials/README.md.
Documentation¶
License¶
MIT.
Developer Guide¶
Layout¶
modal_training_gym/ ← installable package
├── common/ ← cross-framework classes (datasets, models, wandb, Ray cluster helpers)
└── frameworks/ ← one package per training framework (see tutorials/ for the full list)
tutorials/ ← runnable examples — one folder per tutorial
├── tutorial_generator/ ← source files; each produces a .py + .ipynb
└── generate_tutorial.py ← AST-walks the sources, regenerates .py + .ipynb
dashboards/ ← Grafana-style dashboards for monitoring runs
skills/ ← agent skills for navigating this repo
Dev setup¶
# editable install + pinned dev deps (pre-commit, ipykernel if you want)
uv sync
# register this venv as a Jupyter kernel (one-time, for notebook work)
uv run python -m ipykernel install --user --name=modal-training-gym
# install the pre-commit hook locally
uv run pre-commit install
Project Python is pinned to 3.12 (see .python-version / pyproject.toml);
every @app.function(serialized=True) requires the local ↔ remote Python
versions to match, and the framework images we use (slime nightly, NeMo
25.11, verl vllm011.latest) all ship py312.
Authoring a new tutorial¶
See tutorials/README.md
for the generator-source format and the per-tutorial TUTORIAL_METADATA
schema.
Contributing a new framework¶
- Create
modal_training_gym/frameworks/<name>/withconfig.py,launcher.py,__init__.py. build_<name>_app(*, modal, config, name=None) -> modal.Appis the public entrypoint — same shape as the existing frameworks. Seemodal_training_gym/frameworks/verl/launcher.pyfor a complete reference.- Add a new
tutorials/tutorial_generator/<tutorial>.pydemonstrating it, and run the generator. - Container configs (
dataset,model,wandb) are interpreted explicitly in each framework — don't try to share CLI vocabularies. The common config classes stay pure data.