pyspark-structured-streaming

Community

Build reliable real-time pipelines with PySpark

Authorivanshamaev
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps you design and operate PySpark Structured Streaming pipelines for reliable real-time processing, including correct handling of event time, late data, stateful aggregations, and durable restarts.

Core Features & Use Cases

  • End-to-end streaming pipeline setup: configure a streaming job in PySpark for Kafka/file/rate sources and run it as a managed streaming query.
  • Event-time correctness: define watermarks and windowing (tumbling/sliding/session) to handle late events and bound state.
  • Production-grade reliability patterns: apply checkpointing, deduplication, foreachBatch for custom sinks, and production state store tuning with RocksDB; covers Kafka source/sink configuration, stream-stream joins with watermarks, and fault tolerance.
  • Typical use cases: Kafka → Delta/Iceberg for silver-lake upserts, near-real-time dashboards with update/append modes, and debugging/monitoring with streaming query progress metrics.

Quick Start

Ask the assistant to generate a PySpark Structured Streaming job that reads JSON events from Kafka, applies an event-time watermark with a tumbling window, deduplicates by a natural key, and writes results to a Delta Lake table using foreachBatch with a durable checkpoint location.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: pyspark-structured-streaming
Download link: https://github.com/ivanshamaev/de-agent-skills/archive/main.zip#pyspark-structured-streaming

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.