dataset-synthesizer

Community

Generate JSONL fine-tuning datasets from logs

Authorjoleques
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This skill automates the creation of high-quality JSONL datasets for LLM fine-tuning by combining user interaction logs with product documentation, enforcing cleaning rules and data augmentation so models learn correct behaviors rather than noisy error traces.

Core Features & Use Cases

  • Log cleansing and filtering: removes internal agent errors, limits trivial greetings, and refines incomplete or vague responses into technical explanations using documentation.
  • Data augmentation and synthesis: expands sparse logs with synthetic but domain-consistent Q&A derived from documentation to reach the requested sample count.
  • Format and delivery: outputs strict Vertex AI (Gemini) JSONL lines with systemInstruction and contents fields and saves the dataset to the required ./agentAI/fine-tuning/[Title]/dataset/[Title].jsonl path.
  • Correction mode: supports targeted fixes of existing datasets guided by an audit report without redoing augmentation or recreating the entire dataset.

Quick Start

Generate a 300-line Gemini-format JSONL fine-tuning dataset titled MyProduct by merging logs from /path/to/langfuse.jsonl with product documentation in /path/to/docs and save it to ./agentAI/fine-tuning/MyProduct/dataset/MyProduct.jsonl.

Dependency Matrix

Required Modules

langfuse

Components

scripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: dataset-synthesizer
Download link: https://github.com/joleques/northstar-ai/archive/main.zip#dataset-synthesizer

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.