dataset-cleaning
CommunityClean and normalize scraped skill records
Data & Analytics#indexing#information retrieval#data validation#normalization#deduplication#dataset cleaning#scraped data
Authorzhang-ming-hui
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Scraped skill records often contain duplicates, malformed fields, and inconsistent normalized values that degrade indexing quality and retrieval reliability.
Core Features & Use Cases
- Deterministic record normalization: Applies repeatable rules to standardize formats while preserving the underlying meaning.
- Deduplication and validation for indexing compatibility: Reduces duplicate/malformed entries and ensures the cleaned dataset still matches the index-ready record shape.
- Checkpoint-friendly repair workflow: Uses crawl checkpoints and sample datasets to safely improve quality without losing traceability.
Quick Start
Use the dataset-cleaning skill to clean skills_data_500.json/csv using skills_checkpoint.json and produce safer, index-compatible cleaned output with consistent normalized fields.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: dataset-cleaning Download link: https://github.com/zhang-ming-hui/ackownledge/archive/main.zip#dataset-cleaning Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.