pyspark_etl

Community

Build fast, production-grade PySpark ETL.

Authorivanshamaev
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps you design, implement, and review reliable PySpark ETL pipelines that perform well at large scale while preserving correct schemas and correctness under real-world data issues.

Core Features & Use Cases

  • Production ETL design: Define schema contracts, incremental semantics, partitioning, and write modes with explicit, maintainable DataFrame/Spark SQL transformations.
  • Performance optimization: Reduce shuffles and wide transformations, avoid common bottlenecks like small files, shuffle skew, and unnecessary repartitioning, and use Spark plan diagnostics (explain) to guide improvements.
  • Correctness & robustness: Handle joins, deduplication semantics, window frames, nulls, schema evolution boundaries, and targeted testing for transformation logic.

Quick Start

Use the pyspark_etl skill to review your PySpark DataFrame ETL code for join correctness, partitioning strategy, window definitions, and performance risks at GB–TB scale.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: pyspark_etl
Download link: https://github.com/ivanshamaev/de-agent-skills/archive/main.zip#pyspark-etl

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.