pyspark_etl
CommunityBuild fast, production-grade PySpark ETL.
Software Engineering#query optimization#partitioning#etl#pyspark#dataframe#window functions#spark performance
Authorivanshamaev
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill helps you design, implement, and review reliable PySpark ETL pipelines that perform well at large scale while preserving correct schemas and correctness under real-world data issues.
Core Features & Use Cases
- Production ETL design: Define schema contracts, incremental semantics, partitioning, and write modes with explicit, maintainable DataFrame/Spark SQL transformations.
- Performance optimization: Reduce shuffles and wide transformations, avoid common bottlenecks like small files, shuffle skew, and unnecessary repartitioning, and use Spark plan diagnostics (explain) to guide improvements.
- Correctness & robustness: Handle joins, deduplication semantics, window frames, nulls, schema evolution boundaries, and targeted testing for transformation logic.
Quick Start
Use the pyspark_etl skill to review your PySpark DataFrame ETL code for join correctness, partitioning strategy, window definitions, and performance risks at GB–TB scale.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: pyspark_etl Download link: https://github.com/ivanshamaev/de-agent-skills/archive/main.zip#pyspark-etl Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.