stratified-quota-sampling
CommunityBalance long-tail datasets without losing rarity.
System Documentation
What problem does it solve?
It solves the problem of imbalanced, long-tailed datasets where naive sampling either floods frequent classes or starves rare classes, leading to distorted training and weak coverage.
Core Features & Use Cases
- Box-Cox normalization to reshape raw per-class counts toward near-normality.
- Tier binning at ±1σ to create LOW/MID/HIGH strata from the data’s empirical distribution.
- Fibonacci-style quota allocation (5/8/13) to scale selection sublinearly for head classes while preserving rare-class presence.
- Relevance-weighted sampling within each class/tier to prefer higher-quality items while still maintaining coverage.
Use it when you need balanced representation under a fixed budget—such as text corpora by document length, image datasets by class frequency, sparse entity coverage in hierarchical corpora, or any setting where rare categories must remain visible.
Quick Start
Use stratified-quota-sampling to pick a balanced subset by passing your {class_name: [items]} map, setting the item relevance field as weight_key, and choosing a reproducible seed for deterministic draws.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: stratified-quota-sampling Download link: https://github.com/thistleknot/skills/archive/main.zip#stratified-quota-sampling Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.