triton-cuda-memory
CommunityOptimize CUDA memory access for faster kernels.
Authorxchang1121
Version1.0.0
Installs0
System Documentation
What problem does it solve?
CUDA and Triton-CUDA kernels often struggle with memory bandwidth bottlenecks and non-coalesced accesses, leading to wasted cycles and reduced throughput. This guide provides strategies to optimize memory traffic, improve data layout, and lower latency for GPU kernels in memory-bound workloads.
Core Features & Use Cases
- Shared memory utilization to minimize global memory traffic and improve data reuse.
- Coalesced access patterns to maximize memory throughput across warps.
- Advanced layout techniques (grouped ordering, swizzle, and tl.make_block_ptr usage) to boost L2 cache efficiency.
- Use Case: accelerate matrix multiplications, attention-like operations, and large tensor workloads by applying the described memory access optimizations.
Quick Start
Evaluate a target kernel and apply shared memory, coalesced loads, and data layout optimizations to improve throughput.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: triton-cuda-memory Download link: https://github.com/xchang1121/AutoResearch-CC-hook/archive/main.zip#triton-cuda-memory Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.