eval-writer

Community

Generate eval suites for agent benchmarks

Authoranukkrit149
Version1.0.0
Installs0

System Documentation

What problem does it solve?

It helps you turn an evaluation idea into a runnable, repeatable benchmark for agent behavior, complete with test cases and scoring.

Core Features & Use Cases

  • Eval suite scaffolding for deepagentsjs: Creates independent workspace eval packages under evals/ and wires them to the eval harness.
  • Test case design for multiple data sources: Supports inline cases, fixture-based JSON/JSONL, external datasets, and LangSmith dataset examples.
  • Scoring and reporting: Implements trajectory matchers, output comparisons (exact/fuzzy), optional LLM-as-judge evaluators, and hooks up results to LangSmith via the vitest reporter.
  • Optional sandbox-backed execution: Enables containerized execution when the benchmark requires running generated code or shell commands.

Quick Start

Ask the AI to create a new eval suite for your benchmark by specifying the capability to evaluate, the dataset source to use, and the scoring approach you want.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: eval-writer
Download link: https://github.com/anukkrit149/anukkrit-skills/archive/main.zip#eval-writer

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.