webdataset-streaming

Community

Efficient tar-shard streaming for large datasets

Authordongzhuoyao
Version1.0.0
Installs0

System Documentation

What problem does it solve?

WebDataset Streaming replaces slow file-based DataLoaders by enabling high-throughput sequential reads from tar shard archives so training pipelines are not starved by IO or metadata overhead. It removes the need for random file access on large or networked filesystems and enables reuse of precomputed encoder outputs to save GPU time.

Core Features & Use Cases

  • Shard creation and precompute: Use ShardWriter-style workflows to pack samples and optional precomputed encoder latents into tar shards for efficient sequential reads.
  • Progress & epoch estimation: Generate sizes.json to estimate sample counts and power progress bars for IterableDataset training.
  • Shuffle strategies: Guideline for shard-level versus sample-level shuffling, plus recommendations for shuffle buffer sizes and shard granularity.
  • DataLoader integration: Conditional DataLoader kwargs, persistent worker guidance, and explicit advice to avoid passing WebDataset loaders to accelerator.prepare.
  • Latent-shard variant: Store .pth + .json pairs to eliminate encoder forward passes and decode payloads robustly with weights_only disabled when needed.
  • Operational gotchas: Recommendations for num_workers, GPU utilization monitoring, key naming conventions, and common anti-patterns to avoid.

Quick Start

Point your WebDataset loader to your resolved list of shard files, create sizes.json for estimated length, and configure num_workers >= 4 plus an appropriate shuffle buffer before training.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: webdataset-streaming
Download link: https://github.com/dongzhuoyao/tao-research-skills/archive/main.zip#webdataset-streaming

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.