mcapex

Community

Accelerate large model training with fused GPU kernels.

Authordongg622
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill streamlines the training of large language models by providing optimized fused GPU kernels that accelerate normalization, rotary position embedding, and softmax operations.

Core Features & Use Cases

  • Training Acceleration: Provides fused kernels such as FusedLayerNorm, FusedRMSNorm, and FusedDense to speed up LLN training workflows.
  • Framework Compatibility: Supports integration with Megatron-LM and DeepSpeed, enabling seamless deployment in large-scale training setups.
  • Use Case: Use this Skill to improve training throughput for large transformer models by replacing native PyTorch layers with fused kernels, effectively reducing latency and increasing efficiency.

Quick Start

Use the mcApex skill to speed up your LL model training by replacing standard normalization and attention components with fused GPU kernels.

Dependency Matrix

Required Modules

mxmaca

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: mcapex
Download link: https://github.com/dongg622/china-ai-chip-skill/archive/main.zip#mcapex

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.