cutlass

Name: cutlass
Availability: InStock
Author: jstzwj

Community

High-performance GPU kernels for linear algebra and AI workloads.

Software Engineering #gpu #high performance #cuda #matrix multiplication #tensor cores #accelerated computing

Authorjstzwj

Version1.0.0

Installs0

System Documentation

What problem does it solve?

This Skill provides access to NVIDIA's CUTLASS library, enabling the deployment of optimized GPU kernels for matrix multiplication and tensor operations critical in AI and scientific computing.

Core Features & Use Cases

Accelerated GEMM and Tensor Operations: Implements high-throughput matrix multiplications leveraging Tensor Cores across architectures like Volta, Turing, Ampere, Hopper, and Blackwell.
Architecture-Aware Optimization: Tailors kernel execution strategies for specific GPU architectures, maximizing performance and resource utilization.
Versatile Data Type Support: Includes FP64, FP32, FP16, BF16, TF32, FP8, INT8, INT4, and complex types, catering to diverse workload precision requirements.
Use Case: Example: Optimize training of large language models by leveraging CUTLASS kernels for mixed-precision matrix multiplications.

Quick Start

Provide the target matrices' pointers and dimensions to run a GEMM operation with the CUTLASS library's Python API or C++ interface, specifying data types and architecture identifiers as needed.

cutlass

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper