webdataset-streaming
CommunityEfficient tar-shard streaming for large datasets
System Documentation
What problem does it solve?
WebDataset Streaming replaces slow file-based DataLoaders by enabling high-throughput sequential reads from tar shard archives so training pipelines are not starved by IO or metadata overhead. It removes the need for random file access on large or networked filesystems and enables reuse of precomputed encoder outputs to save GPU time.
Core Features & Use Cases
- Shard creation and precompute: Use ShardWriter-style workflows to pack samples and optional precomputed encoder latents into tar shards for efficient sequential reads.
- Progress & epoch estimation: Generate sizes.json to estimate sample counts and power progress bars for IterableDataset training.
- Shuffle strategies: Guideline for shard-level versus sample-level shuffling, plus recommendations for shuffle buffer sizes and shard granularity.
- DataLoader integration: Conditional DataLoader kwargs, persistent worker guidance, and explicit advice to avoid passing WebDataset loaders to accelerator.prepare.
- Latent-shard variant: Store .pth + .json pairs to eliminate encoder forward passes and decode payloads robustly with weights_only disabled when needed.
- Operational gotchas: Recommendations for num_workers, GPU utilization monitoring, key naming conventions, and common anti-patterns to avoid.
Quick Start
Point your WebDataset loader to your resolved list of shard files, create sizes.json for estimated length, and configure num_workers >= 4 plus an appropriate shuffle buffer before training.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: webdataset-streaming Download link: https://github.com/dongzhuoyao/tao-research-skills/archive/main.zip#webdataset-streaming Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.