diffusion-kernel

Community

Optimize diffusion model kernels

Authorrayleizhu
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill provides tools and guidance to optimize the performance of diffusion model kernels, reducing inference latency and improving throughput.

Core Features & Use Cases

  • Kernel Development: Guides for writing custom CUDA and Triton kernels for diffusion models.
  • Performance Analysis: Tools for profiling and benchmarking kernel performance using Nsight Compute and Triton.
  • Optimization Strategies: Best practices for memory access, occupancy, and precision tuning on various GPU architectures (H100, A100, T4).
  • Use Case: A developer wants to speed up the RMSNorm operation in a diffusion model. They can follow the add-cuda-kernel.md guide to implement a highly optimized JIT CUDA kernel, test its correctness, benchmark its performance against PyTorch, and profile it with Nsight Compute to ensure it saturates GPU bandwidth.

Quick Start

Use the diffusion-kernel skill to add a new Triton kernel for fused elementwise operations.

Dependency Matrix

Required Modules

None required

Components

referencesscripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: diffusion-kernel
Download link: https://github.com/rayleizhu/sglang/archive/main.zip#diffusion-kernel

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.