nemo-curator
CommunityGPU-accelerated data curation for LLM training.
Data & Analytics#GPU acceleration#multimodal#deduplication#PII redaction#data curation#LLM training data#Nemo Curator
Authorovachiever
Version1.0.0
Installs0
System Documentation
What problem does it solve?
NeMo Curator provides GPU-accelerated data curation for large multimodal datasets used in LLM training, enabling fast deduplication, high-quality filtering, PII redaction, and NSFW detection across text, image, video, and audio data.
Core Features & Use Cases
- 16× faster fuzzy deduplication on large datasets (e.g., 8TB RedPajama v2)
- 30+ quality filters and heuristics for data quality
- Semantic, exact, and semantic deduplication options
- PII redaction and NSFW detection to safeguard privacy and safety
- Scales across GPU clusters for web data curation and large corpora
Quick Start
Install Nemo Curator, load 600+ tools, and run a simple corpus curation pipeline that filters low-quality documents, deduplicates, redacts PII, and saves to Parquet.
Dependency Matrix
Required Modules
nemo-curatorcudfdaskrapids
Components
scriptsreferencesassets
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: nemo-curator Download link: https://github.com/ovachiever/droid-tings/archive/main.zip#nemo-curator Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.