Research Deep

Evals

Comprehensive AI agent evaluation framework with three grader types (code-based: deterministic; model-based: LLM rubric; human: gold standard) and pass@k / pass^k scoring.

08
Workflows
05
References
09
Triggers
high
Effort

The Problem

When an AI agent breaks, you usually find out after the fact — wrong tool call, bad output, regression from a prompt change you thought was safe. Generic testing frameworks grade final outputs but miss everything that happened in between: which tools were called, in what order, whether the agent took a reasonable path to get there. You also have no principled way to distinguish a capability gap (the agent can't do this yet) from a regression (the agent used to do this and now doesn't). Without that distinction, you're flying blind when upgrading models or iterating on prompts.

How This Skill Approaches It

Evals evaluates agent workflows — transcripts, tool-call sequences, multi-turn conversations — not just outputs. Three grader types cover different verification needs: code-based graders (string_match, regex_match, binary_tests, tool_calls, state_check) run fast and deterministically; model-based graders (llm_rubric, natural_language_assert, pairwise_comparison) handle quality and nuance; human graders calibrate the LLM judges against gold standard. pass@k scoring runs multiple trials to get statistical significance, and pass^k measures consistency. Capability evals target ~70% pass rate as a stretch goal; regression evals target ~99% as a quality gate. TrialRunner.ts handles multi-trial execution, SuiteManager.ts manages eval suites and saturation checks, FailureToTask.ts converts real failures into test cases, and AlgorithmBridge.ts wires eval results directly into Algorithm ISC rows for automated verification. Pre-configured domain patterns (coding, conversational, research, computer-use) give you a grader stack to start from.

  • Evaluates agent transcripts, tool-call sequences, and multi-turn conversations — not just single outputs
  • Capability evals (~70% target) and regression evals (~99% target)
  • Workflows: RunEval, CompareModels, ComparePrompts, CreateJudge, CreateUseCase, RunScenario, CreateScenario, ViewResults
  • Integrates with ALGORITHM ISC rows for automated verification
  • Domain patterns pre-configured for coding, conversational, research, computer-use agents
Not for scientific method framing (use Science)

In Action

What you say to your DA, and what the Evals skill actually does.

  • You say "run evals on the auth skill after the changes i made"
    Runs RunEval against the existing auth test suite via AlgorithmBridge.ts, executes pass@3 trials, grades with the domain's pre-configured grader stack (binary_tests + tool_calls + llm_rubric), and reports pass rate, any regressions, and updated ISC row status.
  • You say "compare these two prompt versions and tell me which produces better summaries"
    Runs ComparePrompts: creates a grading suite with pairwise_comparison and llm_rubric graders, runs both prompts against the same test cases with position swapping, and reports pass@k scores plus a comparative analysis of where each prompt wins and loses.
  • You say "create a regression test from that auth bypass failure we just fixed"
    Runs CreateUseCase: logs the failure via FailureToTask.ts, defines an unambiguous task with code-based and model-based graders, and adds it to the regression suite so the same failure can never silently reappear.

Inside the Skill

The thinking, frameworks, and architecture that distinguish this skill from a generic version of the same task.

What It Does

Evaluates AI agents — their transcripts, tool-call sequences, and multi-turn conversations, not just single outputs. Three grader types cover it: code-based for deterministic checks, model-based for nuanced quality rubrics, and human for the gold standard. Scores with pass@k (capability) and pass^k (consistency). Splits evals into capability suites (~70% target) and regression suites (~99% target), and plugs into Algorithm ISC rows as a verification method.

The Problem

You can't tell if an agent got better or worse by eyeballing a few runs. A change that looks fine in one transcript may quietly regress on the next prompt, and a single run gives no statistical signal. Judging only the final output also misses how the agent got there — wrong tools, wrong order, lucky guess. This skill measures the whole workflow across repeated trials, so improvements and backsliding both show up as numbers you can gate on.

How It Works

Agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026). It evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.

When to Activate

  • "run evals", "test this agent", "evaluate", "check quality", "benchmark"
  • "regression test", "capability test"
  • "run scenario", "multi-turn eval", "simulated user test"
  • "create scenario", "simulate conversation"
  • Compare agent behaviors across changes
  • Validate agent workflows before deployment
  • Verify ALGORITHM ISC rows
  • Create new evaluation tasks from failures

Core Concepts

Three Grader Types

Type Strengths Weaknesses Use For
Code-based Fast, cheap, deterministic, reproducible Brittle, lacks nuance Tests, state checks, tool verification
Model-based Flexible, captures nuance, scalable Non-deterministic, expensive Quality rubrics, assertions, comparisons
Human Gold standard, handles subjectivity Expensive, slow Calibration, spot checks, A/B testing

Evaluation Types

Type Pass Target Purpose
Capability ~70% Stretch goals, measuring improvement potential
Regression ~99% Quality gates, detecting backsliding

Key Metrics

  • pass@k: Probability of at least 1 success in k trials (measures capability)
  • pass^k: Probability all k trials succeed (measures consistency/reliability)

Quick Reference

CLI Commands

# Run an eval suite
bun run ${CLAUDE_SKILL_DIR}/Tools/AlgorithmBridge.ts -s <suite>

# Log a failure for later conversion
bun run ${CLAUDE_SKILL_DIR}/Tools/FailureToTask.ts log "description" -c category -s severity

# Convert failures to test tasks
bun run ${CLAUDE_SKILL_DIR}/Tools/FailureToTask.ts convert-all

# Manage suites
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts list
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts check-saturation <name>
bun run ${CLAUDE_SKILL_DIR}/Tools/SuiteManager.ts graduate <name>

ALGORITHM Integration

Evals is a verification method for THE ALGORITHM ISC rows:

# Run eval and update ISC row
bun run ${CLAUDE_SKILL_DIR}/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u

ISC rows can specify eval verification:

| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |

Available Graders

Code-Based (Fast, Deterministic)

Grader Use Case
string_match Exact substring matching
regex_match Pattern matching
binary_tests Run test files
static_analysis Lint, type-check, security scan
state_check Verify system state after execution
tool_calls Verify specific tools were called

Model-Based (Nuanced)

Grader Use Case
llm_rubric Score against detailed rubric
natural_language_assert Check assertions are true
pairwise_comparison Compare to reference with position swap

Domain Patterns

Pre-configured grader stacks for common agent types:

Domain Primary Graders
coding binary_tests + static_analysis + tool_calls + llm_rubric
conversational llm_rubric + natural_language_assert + state_check
research llm_rubric + natural_language_assert + tool_calls
computer_use state_check + tool_calls + llm_rubric

See Data/DomainPatterns.yaml for full configurations.


Task Schema (YAML)

task:
  id: "fix-auth-bypass_1"
  description: "Fix authentication bypass when password is empty"
  type: regression  # or capability
  domain: coding

  graders:
    - type: binary_tests
      required: [test_empty_pw.py]
      weight: 0.30

    - type: tool_calls
      weight: 0.20
      params:
        sequence: [read_file, edit_file, run_tests]

    - type: llm_rubric
      weight: 0.50
      params:
        rubric: prompts/security_review.md

  trials: 3
  pass_threshold: 0.75

Resource Index

Resource Purpose
Types/index.ts Core type definitions
Graders/CodeBased/ Deterministic graders
Graders/ModelBased/ LLM-powered graders
Tools/TranscriptCapture.ts Capture agent trajectories
Tools/TrialRunner.ts Multi-trial execution with pass@k
Tools/SuiteManager.ts Suite management and saturation
Tools/FailureToTask.ts Convert failures to test tasks
Tools/AlgorithmBridge.ts ALGORITHM integration
Tools/ScenarioRunner.ts Multi-turn scenario runner (langwatch/scenario)
Tools/PAIAgentAdapter.ts Wraps PAI Inference.ts as scenario AgentAdapter
Tools/ScenarioToTranscript.ts Scenario result → Evals Transcript/Trial/GraderResult
Scenarios/ Authored multi-turn scenarios (.scenario.ts)
Data/DomainPatterns.yaml Domain-specific grader configs

Key Principles (from Anthropic)

  1. Start with 20-50 real failures - Don't overthink, capture what actually broke
  2. Unambiguous tasks - Two experts should reach identical verdicts
  3. Balanced problem sets - Test both "should do" AND "should NOT do"
  4. Grade outputs, not paths - Don't penalize valid creative solutions
  5. Calibrate LLM judges - Against human expert judgment
  6. Check transcripts regularly - Verify graders work correctly
  7. Monitor saturation - Graduate to regression when hitting 95%+
  8. Build infrastructure early - Evals shape how quickly you can adopt new models

Related

  • ALGORITHM: Evals is a verification method
  • Science: Evals implements scientific method
  • Browser: For visual verification graders

Gotchas

  • Choose the right grader type: Code-based for deterministic checks (fast, cheap). Model-based for nuanced quality (flexible, expensive). Human for calibration (gold standard, slow).
  • pass@k scoring requires multiple runs. A single run doesn't give statistical significance. Default to pass@3 minimum.
  • Transcript capture must be enabled BEFORE the test run. Can't retroactively capture transcripts.
  • Eval results go to the current work directory — not a global location. Tie evals to the work item.
  • Don't evaluate skills with trivial prompts. Simple one-liners may not trigger skill usage. Test prompts must be substantive.

Examples

Example 1: Compare two prompts

User: "evaluate which prompt produces better summaries"
→ Creates eval suite with 3+ test cases
→ Runs both prompts against test cases
→ Model-based grader scores quality
→ Reports pass@k and comparative analysis

Example 2: Regression test a skill change

User: "run evals on the Research skill after the update"
→ Uses existing test fixtures for Research
→ Before/after comparison
→ Reports any quality regressions

Workflows · 8

  1. 01
    CompareModels Workflows/CompareModels.md
  2. 02
    ComparePrompts Workflows/ComparePrompts.md
  3. 03
    CreateJudge Workflows/CreateJudge.md
  4. 04
    CreateScenario Workflows/CreateScenario.md
  5. 05
    CreateUseCase Workflows/CreateUseCase.md
  6. 06
    RunEval Workflows/RunEval.md
  7. 07
    RunScenario Workflows/RunScenario.md
  8. 08
    ViewResults Workflows/ViewResults.md

How to Invoke

Say any of these to your DA and PAI activates the Evals skill automatically:

  • "eval"
  • "evaluate"
  • "benchmark"
  • "regression test"
  • "compare models"
  • "create judge"
  • "test agent"
  • "pass@k"
  • "scenario simulation"

Or invoke explicitly:

Skill("Evals")

References · 5

Auxiliary files the skill loads at runtime — frameworks, guides, configs.

  • BestPractices
  • PROJECT
  • ScienceMapping
  • ScorerTypes
  • TemplateIntegration

References & Credits

The thinkers, books, frameworks, and research this skill is built on. The ideas belong to them — the integration belongs to PAI.

Want PAI to do this for you?

Install PAI on your machine — your DA gets the Evals skill plus 44 others, all hooked into one Life OS.