Precision + Numerics Stabilizer (bf16/fp16 Done Right)

Community

Safe, reliable mixed-precision training

Authorsovr610
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Add a precision layer and numerics instrumentation that makes mixed-precision training safe to run unattended by detecting NaNs/Inf, monitoring gradient norms, and providing reproducible failure snapshots.

Core Features & Use Cases

  • Automatic detection of NaNs, Inf, and gradient overflow with auto-abort or debug snapshots.
  • GradScaler integration for fp16 with automatic scale management and safe unscale/clip ordering.
  • Flexible precision modes: bf16, fp16, or fp32 with autocast constraints and master-weights-in-fp32 invariant.
  • Numerics sentinel: gradient norms, logit monitoring, activation checks, and weight non-finite checks.
  • Snapshotting: 7-file failure snapshots with atomic writes for reproducibility, diagnostic data, and repro scripts.

Quick Start

Integrate this stabilizer into your PyTorch training workflow and run your loop to enable automatic NaN detection, mixed-precision safety, and reproducible failure snapshots.

Dependency Matrix

Required Modules

torchnumpypytest

Components

scriptsreferencesassets

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: Precision + Numerics Stabilizer (bf16/fp16 Done Right)
Download link: https://github.com/sovr610/refffiy/archive/main.zip#precision-numerics-stabilizer-bf16-fp16-done-right

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.