nemo-curator

Community

GPU-accelerated data curation for LLM training.

Authorovachiever
Version1.0.0
Installs0

System Documentation

What problem does it solve?

NeMo Curator provides GPU-accelerated data curation for large multimodal datasets used in LLM training, enabling fast deduplication, high-quality filtering, PII redaction, and NSFW detection across text, image, video, and audio data.

Core Features & Use Cases

  • 16× faster fuzzy deduplication on large datasets (e.g., 8TB RedPajama v2)
  • 30+ quality filters and heuristics for data quality
  • Semantic, exact, and semantic deduplication options
  • PII redaction and NSFW detection to safeguard privacy and safety
  • Scales across GPU clusters for web data curation and large corpora

Quick Start

Install Nemo Curator, load 600+ tools, and run a simple corpus curation pipeline that filters low-quality documents, deduplicates, redacts PII, and saves to Parquet.

Dependency Matrix

Required Modules

nemo-curatorcudfdaskrapids

Components

scriptsreferencesassets

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: nemo-curator
Download link: https://github.com/ovachiever/droid-tings/archive/main.zip#nemo-curator

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.