Write CUDA Reduction Kernel
CommunityDesign fast, correct CUDA reductions for GPUs.
AuthorKrxGu
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This skill provides a thorough, reusable approach to implementing correct and efficient CUDA reduction kernels, handling warp-level primitives, block-level reductions, and multi-block strategies to scale to large inputs.
Core Features & Use Cases
- Warp-level shuffle reductions with explicit masks and non-full-warp handling
- Block-level reductions using shared memory with correct synchronization
- Two-pass vs atomic strategies for large inputs and reproducible results
- Batched and axis-based reductions with proper memory access patterns
Quick Start
Implement a CUDA reduction kernel that reduces a 1D array of floats to a single value using a 256-thread block and a two-pass strategy.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: Write CUDA Reduction Kernel Download link: https://github.com/KrxGu/kernel-skills/archive/main.zip#write-cuda-reduction-kernel Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.