Write CUDA Reduction Kernel

Community

Design fast, correct CUDA reductions for GPUs.

AuthorKrxGu
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This skill provides a thorough, reusable approach to implementing correct and efficient CUDA reduction kernels, handling warp-level primitives, block-level reductions, and multi-block strategies to scale to large inputs.

Core Features & Use Cases

  • Warp-level shuffle reductions with explicit masks and non-full-warp handling
  • Block-level reductions using shared memory with correct synchronization
  • Two-pass vs atomic strategies for large inputs and reproducible results
  • Batched and axis-based reductions with proper memory access patterns

Quick Start

Implement a CUDA reduction kernel that reduces a 1D array of floats to a single value using a 256-thread block and a two-pass strategy.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: Write CUDA Reduction Kernel
Download link: https://github.com/KrxGu/kernel-skills/archive/main.zip#write-cuda-reduction-kernel

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.