Install
npm install agent-skills-eval
Open source evaluator for Agent Skills
Run skill-aware and baseline model calls, grade outputs with a judge model, write portable artifacts, and publish static reports without depending on a larger benchmark app.
npm install agent-skills-eval
npx agent-skills-eval --config agent-skills-eval.yaml
npm install agent-skills-eval
OPENAI_BASE_URL=https://api.openai.com/v1 \
OPENAI_API_KEY=... \
npx agent-skills-eval ./skills \
--target gpt-4o-mini \
--judge gpt-4o-mini \
--baseline \
--strict
Use config files for repeatable local and CI runs. CLI flags override config values.
root: ./skills
workspace: ./agent-skills-workspace
baseline: true
target: gpt-4o-mini
judge: gpt-4o-mini
baseUrl: https://api.openai.com/v1
apiKeyEnv: OPENAI_API_KEY
concurrency: 4
layout: iteration
strict: true
report:
enabled: true
title: Agent Skills Report
logging:
format: pretty
verbose: false
targetParams:
temperature: 0
judgeParams:
temperature: 0
import {
OpenAICompatibleProvider,
consoleReporter,
evaluateSkills,
} from "agent-skills-eval";
const provider = new OpenAICompatibleProvider({
baseUrl: "https://api.openai.com/v1",
apiKey: process.env.OPENAI_API_KEY!,
model: "gpt-4o-mini",
providerName: "openai",
});
const result = await evaluateSkills({
root: "./skills",
workspace: "./agent-skills-workspace",
baseline: true,
workspaceLayout: "iteration",
strict: true,
target: { model: provider.model, provider },
judge: { model: provider.model, provider },
onEvent: consoleReporter(),
});
---
name: csv-analyzer
description: Analyze CSV files.
license: MIT
---
Identify trends and cite the relevant rows.
{
"evals": [
{
"id": "basic",
"prompt": "Find the highest revenue month.",
"files": ["evals/files/revenue.csv"],
"assertions": [
"The answer names February."
]
}
]
}
Runs produce the official iteration-N workspace layout with prompts, outputs, timings,
grading, optional tool calls, benchmarks, and a static HTML report.
agent-skills-workspace/
iteration-1/
meta.json
benchmark.json
eval-basic/
with_skill/
prompts.json
timing.json
grading.json
outputs/response.txt
without_skill/
...