Open source evaluator for Agent Skills

Evaluate agentskills.io-style skills from a CLI, SDK, or CI workflow.

Run skill-aware and baseline model calls, grade outputs with a judge model, write portable artifacts, and publish static reports without depending on a larger benchmark app.

npm install agent-skills-eval npx agent-skills-eval --config agent-skills-eval.yaml

Quickstart

Install

npm install agent-skills-eval

Run

OPENAI_BASE_URL=https://api.openai.com/v1 \
OPENAI_API_KEY=... \
npx agent-skills-eval ./skills \
  --target gpt-4o-mini \
  --judge gpt-4o-mini \
  --baseline \
  --strict

YAML Config

Use config files for repeatable local and CI runs. CLI flags override config values.

root: ./skills
workspace: ./agent-skills-workspace
baseline: true
target: gpt-4o-mini
judge: gpt-4o-mini
baseUrl: https://api.openai.com/v1
apiKeyEnv: OPENAI_API_KEY
concurrency: 4
layout: iteration
strict: true
report:
  enabled: true
  title: Agent Skills Report
logging:
  format: pretty
  verbose: false
targetParams:
  temperature: 0
judgeParams:
  temperature: 0

SDK Usage

import {
  OpenAICompatibleProvider,
  consoleReporter,
  evaluateSkills,
} from "agent-skills-eval";

const provider = new OpenAICompatibleProvider({
  baseUrl: "https://api.openai.com/v1",
  apiKey: process.env.OPENAI_API_KEY!,
  model: "gpt-4o-mini",
  providerName: "openai",
});

const result = await evaluateSkills({
  root: "./skills",
  workspace: "./agent-skills-workspace",
  baseline: true,
  workspaceLayout: "iteration",
  strict: true,
  target: { model: provider.model, provider },
  judge: { model: provider.model, provider },
  onEvent: consoleReporter(),
});

Skill Format

SKILL.md

---
name: csv-analyzer
description: Analyze CSV files.
license: MIT
---

Identify trends and cite the relevant rows.

evals/evals.json

{
  "evals": [
    {
      "id": "basic",
      "prompt": "Find the highest revenue month.",
      "files": ["evals/files/revenue.csv"],
      "assertions": [
        "The answer names February."
      ]
    }
  ]
}

Artifacts And Reports

Runs produce the official iteration-N workspace layout with prompts, outputs, timings, grading, optional tool calls, benchmarks, and a static HTML report.

agent-skills-workspace/
  iteration-1/
    meta.json
    benchmark.json
    eval-basic/
      with_skill/
        prompts.json
        timing.json
        grading.json
        outputs/response.txt
      without_skill/
        ...