diffusion-kernel
CommunityOptimize diffusion model kernels
Software Engineering#performance optimization#profiling#diffusion models#cuda#triton#gpu kernels#jit compilation
Authorrayleizhu
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill provides tools and guidance to optimize the performance of diffusion model kernels, reducing inference latency and improving throughput.
Core Features & Use Cases
- Kernel Development: Guides for writing custom CUDA and Triton kernels for diffusion models.
- Performance Analysis: Tools for profiling and benchmarking kernel performance using Nsight Compute and Triton.
- Optimization Strategies: Best practices for memory access, occupancy, and precision tuning on various GPU architectures (H100, A100, T4).
- Use Case: A developer wants to speed up the RMSNorm operation in a diffusion model. They can follow the
add-cuda-kernel.mdguide to implement a highly optimized JIT CUDA kernel, test its correctness, benchmark its performance against PyTorch, and profile it with Nsight Compute to ensure it saturates GPU bandwidth.
Quick Start
Use the diffusion-kernel skill to add a new Triton kernel for fused elementwise operations.
Dependency Matrix
Required Modules
None requiredComponents
referencesscripts
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: diffusion-kernel Download link: https://github.com/rayleizhu/sglang/archive/main.zip#diffusion-kernel Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.