Name: pyspark_etl
Availability: InStock
Author: ivanshamaev

System Documentation

What problem does it solve?

This Skill helps you design, implement, and review reliable PySpark ETL pipelines that perform well at large scale while preserving correct schemas and correctness under real-world data issues.

Core Features & Use Cases

Production ETL design: Define schema contracts, incremental semantics, partitioning, and write modes with explicit, maintainable DataFrame/Spark SQL transformations.
Performance optimization: Reduce shuffles and wide transformations, avoid common bottlenecks like small files, shuffle skew, and unnecessary repartitioning, and use Spark plan diagnostics (explain) to guide improvements.
Correctness & robustness: Handle joins, deduplication semantics, window frames, nulls, schema evolution boundaries, and targeted testing for transformation logic.

Quick Start

Use the pyspark_etl skill to review your PySpark DataFrame ETL code for join correctness, partitioning strategy, window definitions, and performance risks at GB–TB scale.

Please help me install this Skill: Name: pyspark_etl Download link: https://github.com/ivanshamaev/de-agent-skills/archive/main.zip#pyspark-etl Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

pyspark_etl

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper