stratified-quota-sampling

Community

Balance long-tail datasets without losing rarity.

Authorthistleknot
Version1.0.0
Installs0

System Documentation

What problem does it solve?

It solves the problem of imbalanced, long-tailed datasets where naive sampling either floods frequent classes or starves rare classes, leading to distorted training and weak coverage.

Core Features & Use Cases

  • Box-Cox normalization to reshape raw per-class counts toward near-normality.
  • Tier binning at ±1σ to create LOW/MID/HIGH strata from the data’s empirical distribution.
  • Fibonacci-style quota allocation (5/8/13) to scale selection sublinearly for head classes while preserving rare-class presence.
  • Relevance-weighted sampling within each class/tier to prefer higher-quality items while still maintaining coverage.

Use it when you need balanced representation under a fixed budget—such as text corpora by document length, image datasets by class frequency, sparse entity coverage in hierarchical corpora, or any setting where rare categories must remain visible.

Quick Start

Use stratified-quota-sampling to pick a balanced subset by passing your {class_name: [items]} map, setting the item relevance field as weight_key, and choosing a reproducible seed for deterministic draws.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: stratified-quota-sampling
Download link: https://github.com/thistleknot/skills/archive/main.zip#stratified-quota-sampling

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.