checkpoints
OfficialReliable ML checkpointing for resume and sampling
Authorthinking-machines-lab
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Checkpointing solves the challenge of restarting interrupted training runs and exporting sampler weights for evaluation by providing structured saving, loading, and lifecycle management through a CheckpointRecord.
Core Features & Use Cases
- Save full training state with save_state to resume training, including weights and optimizer state.
- Save only sampler weights with save_weights_for_sampler for sampling/export, or use save_weights_and_get_sampling_client to obtain a ready-to-use client.
- Manage checkpoints via REST API and CLI (list, publish, set TTL, delete) to maintain lifecycle and provenance.
- Serialize and deserialize checkpoint records with CheckpointRecord for robust tracking and reproducibility.
Quick Start
Save the full training state with save_state for resumption, or export only the sampler weights with save_weights_for_sampler for fast evaluation.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: checkpoints Download link: https://github.com/thinking-machines-lab/tinker-cookbook/archive/main.zip#checkpoints Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.