Name: Training Data Dedup + Leakage Guard
Availability: InStock
Author: sovr610

System Documentation

What problem does it solve?

This skill solves the critical challenge of deduplicating training data and preventing leakage of holdout/test content into training corpora, ensuring honest evaluation and cleaner training sets.

Core Features & Use Cases

Multi-stage dedup: canonicalization, exact dedup, fuzzy dedup via MinHash+LSH, and leakage guard against holdout sets.
Audit-ready: generates a reproducible audit trail and JSON report detailing duplicates removed and leakage detected.
Use Case: sanitize a large text corpus by removing exact and near-duplicate samples and flagging potential leakage before model training.

Quick Start

Run the dedup-leakage pipeline on your training corpus to remove exact and near-duplicate samples and flag potential leakage against holdout sets.

Please help me install this Skill: Name: Training Data Dedup + Leakage Guard Download link: https://github.com/sovr610/refffiy/archive/main.zip#training-data-dedup-leakage-guard Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

Training Data Dedup + Leakage Guard

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper