Distributed Memory Scaling (FSDP / DeepSpeed ZeRO)

Community

Config-driven, portable distributed training.

Authorsovr610
Version1.0.0
Installs0

System Documentation

What problem does it solve?

The skill provides a unified, config-driven approach to running distributed training across multiple backends (DDP, FSDP, and DeepSpeed ZeRO) with portable weights and cross-strategy checkpointing.

Core Features & Use Cases

  • StrategyRouter enables same training scripts to switch between DDP, FSDP, and DeepSpeed ZeRO purely via configuration.
  • Lightweight wrappers and utilities handle portable full-state dict export, sharded checkpointing, and deterministic DeepSpeed JSON config generation.
  • Cross-strategy portability and cross-world-size resume workflows empower scalable experimentation and production training pipelines.
  • A scaling benchmark scaffold measures single-GPU vs multi-GPU performance, with a metrics schema compatible with performance regression gates.

Quick Start

Configure StrategyRouter with your chosen strategy, wrap the model, create the optimizer, and run training using a single, config-driven code path.

Dependency Matrix

Required Modules

None required

Components

scriptsreferencesassets

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: Distributed Memory Scaling (FSDP / DeepSpeed ZeRO)
Download link: https://github.com/sovr610/refffiy/archive/main.zip#distributed-memory-scaling-fsdp-deepspeed-zero

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.