Development Standard

Optimize

"Autonomous optimization loop — hill-climb any target.

00
Workflows
00
References
07
Triggers
medium
Effort

The Problem

Ask a generic AI to improve your code's performance or your skill's output quality and you get one suggestion, applied once, with no way to know if it actually helped. There's no measurement loop, no systematic hypothesis testing, and no memory of what was tried. You're editing by feel, not by signal. Skills and prompts are even worse — there's no obvious metric to track, so the AI just rewrites things in ways that sound better without any evidence they are.

How This Skill Approaches It

Optimize runs an autonomous hill-climbing loop with two modes. Metric mode is for code targets: you give it a shell command that produces a number (Lighthouse score, bundle bytes, validation loss), a set of files it can modify, and a time budget per experiment. The loop hypothesizes, modifies, measures, keeps improvements, reverts failures, and repeats — roughly 12 experiments per hour. Eval mode is for skills, prompts, and agents: the system reads the target, auto-generates binary eval criteria and test inputs, presents them for your approval, then runs LLM-as-judge scoring across multiple runs per experiment to reduce noise. ISA guard rails act as invariants across all experiments — if a modification passes the metric but violates a guard rail, it auto-reverts. The loop exits at a configurable experiment budget or time cap and presents a diff with apply/reject/partial options.

  • Code with metrics, or skills/prompts/agents with LLM-as-judge

In Action

What you say to your DA, and what the Optimize skill actually does.

  • You say "optimize the lighthouse performance score for my site"
    Enters metric mode with the Lighthouse shell command as the measure, sets up a git-branch sandbox on the specified file globs, runs the hypothesis-modify-measure loop targeting the performance score, and at the end presents a diff of every kept change with the score delta.
  • You say "optimize the extractwisdom skill — it keeps giving me generic output"
    Enters eval mode on the ExtractWisdom skill directory, auto-generates binary eval criteria (does the output contain specific domain insights? does it avoid filler phrases?), generates test inputs, runs LLM-as-judge scoring across experiments, and recommends which diffs to apply when done.

Inside the Skill

The thinking, frameworks, and architecture that distinguish this skill from a generic version of the same task.

What It Does

Runs an autonomous optimization loop against any target. The agent modifies the target, measures the result, keeps improvements, discards failures, and repeats until it stops climbing. Two modes: metric mode for code targets that produce a number (latency, bundle size), and eval mode for skills, prompts, or agents judged by LLM-as-judge binary evals.

The Problem

Tuning a thing for a measurable outcome is slow, boring, manual work. You change a file, run the measurement, eyeball whether it got better, keep or revert, then do it again — dozens of times. People give up after a few rounds and settle for "good enough" far short of the real ceiling. The targets without a clean number (a skill's quality, a prompt's effectiveness) are worse: there's no easy way to tell if a change actually helped. This skill runs that whole loop for you and only keeps changes that measurably win.

How It Works

Two modes drive the same hill-climb loop:

  • Metric mode — code targets with a shell command that produces a number (the original).
  • Eval mode — skills, prompts, agents, or any text target judged by LLM-as-judge binary evals.

Inspired by Karpathy's autoresearch and extended with LLM-as-judge evaluation.

Invocation

Metric Mode (code targets)

/optimize --metric "lighthouse_score" --higher-is-better \
  --measure "npx lighthouse http://localhost:3000 --output=json" \
  --extract "jq '.categories.performance.score * 100' lighthouse.json" \
  --files "src/**/*.tsx,src/**/*.css" \
  --budget 120

/optimize --resume        # Resume a previous optimization loop
/optimize --status        # Show results summary from last/current run

Eval Mode (skill/prompt/agent targets)

/optimize --target "~/.claude/skills/ExtractWisdom"
/optimize --target "~/.claude/skills/Research/Workflows/QuickResearch.md"
/optimize --target "prompts/my-prompt.md"
/optimize --target "~/.claude/skills/ExtractWisdom" --max-experiments 20

In eval mode, the system automatically:

  1. Detects the target type (skill, prompt, agent, code, function)
  2. Reads the target to understand its purpose and constraints
  3. Generates 3-6 binary eval criteria and 3-5 test inputs
  4. Presents criteria + inputs for your approval before starting
  5. Runs the optimization loop using LLM-as-judge scoring
  6. Presents a recommendation (apply/reject/partial) when done

What Happens

This skill triggers the PAI Algorithm in mode: optimize:

  1. OBSERVE — Define or auto-detect the target, set eval_mode
  2. THINK — Analyze codebase/skill, generate hypothesis queue
  3. PLAN — Prioritize hypotheses by expected impact
  4. BUILD — Phase 0: TARGET ANALYSIS (see optimize-loop.md)
    • Detect target type, auto-generate eval criteria (eval mode), set up sandbox, baseline
  5. EXECUTE — The autonomous loop (optimize-loop.md):
    • Hypothesize → Modify target → Measure (metric or eval) → Keep/Revert → Repeat
    • Metric mode: ~12 experiments/hour (at 5-min budget)
    • Eval mode: ~6-8 experiments/hour (multi-run judging is slower)
  6. VERIFY — Phase 9: RECOMMEND — diff, summary, apply/reject/partial options
  7. LEARN — Phase 10: EXTRACT LEARNINGS — what worked, what didn't, structured insights

Arguments — Metric Mode

Argument Required Default Description
--metric NAME yes Human-readable metric name
--measure COMMAND yes Shell command that produces the metric
--files GLOB yes Files the agent may modify (comma-separated)
--higher-is-better (default) Higher metric values are better
--lower-is-better Lower metric values are better
--extract COMMAND Last number in stdout Extract metric from output
--budget SECONDS 300 Time budget per experiment
--target VALUE none Stop when metric reaches this value
--max-experiments N none Stop after N experiments
--locked GLOB none Files the agent must NOT modify
--constraints TEXT none Additional rules (e.g., "tests must pass")

Arguments — Eval Mode

Argument Required Default Description
--target PATH yes Path to skill directory, prompt file, or agent definition
--max-experiments N none Stop after N experiments
--runs N 3 Runs per experiment (more = more reliable, slower)
--criteria "Q1" "Q2" auto-generated Override auto-generated eval criteria
--inputs "I1" "I2" auto-generated Override auto-generated test inputs
--budget SECONDS 300 Time budget per experiment

Shared Arguments

Argument Description
--resume Resume a previous optimization run
--status Show results summary

Algorithm Integration

When /optimize is invoked, the Algorithm enters with mode: optimize in the ISA frontmatter. The eval_mode is set based on arguments:

  • --measure provided → eval_mode: metric (git branch sandbox)
  • --target provided → eval_mode: eval (directory sandbox)

ISC criteria become guard rails — assertions that must hold true across ALL experiments. Guard rails must REMAIN satisfied perpetually. A violation triggers automatic revert regardless of score improvement.

Reference files:

  • ~/.claude/PAI/ALGORITHM/optimize-loop.md — the full loop protocol
  • ~/.claude/PAI/ALGORITHM/eval-guide.md — how to write good eval criteria
  • ~/.claude/PAI/ALGORITHM/target-types.md — target detection and ISC generation

Examples

Metric Mode

Optimize page load time:

/optimize --metric "lighthouse_perf" --higher-is-better \
  --measure "npx lighthouse http://localhost:3000 --output=json --output-path=lh.json" \
  --extract "jq '.categories.performance.score * 100' lh.json" \
  --files "src/**/*.tsx,src/**/*.css" \
  --target 95 --budget 120

Optimize bundle size:

/optimize --metric "bundle_bytes" --lower-is-better \
  --measure "bun run build 2>&1 && du -sb dist/ | cut -f1" \
  --files "src/**/*.ts" \
  --constraints "all tests must pass"

ML training (Karpathy-style):

/optimize --metric "val_bpb" --lower-is-better \
  --measure "uv run train.py > run.log 2>&1 && grep '^val_bpb:' run.log | cut -d' ' -f2" \
  --files "train.py" \
  --locked "prepare.py" \
  --budget 300

Eval Mode

Optimize a skill's Extract workflow:

/optimize --target "~/.claude/skills/ExtractWisdom" --max-experiments 15

Optimize a standalone prompt:

/optimize --target "prompts/summarize-article.md" --runs 5

Optimize with custom criteria:

/optimize --target "~/.claude/skills/Research/Workflows/QuickResearch.md" \
  --criteria "Does the output contain specific facts with sources?" \
            "Is the output structured with clear sections?" \
            "Does the output avoid generic filler?" \
  --inputs "research quantum computing breakthroughs 2025" \
           "quick research on supply chain security" \
           "find recent developments in AI agents"

Gotchas

  • Hill-climbing can get stuck in local optima. If score plateaus, consider resetting with different initial conditions.
  • Eval mode vs metric mode: Use metric mode for quantifiable targets (latency, size). Use eval mode for qualitative targets (skill quality, prompt effectiveness).
  • Regression tolerance prevents catastrophic changes. Don't set it to 0 — some regression in secondary metrics is acceptable if primary metric improves significantly.

How to Invoke

Say any of these to your DA and PAI activates the Optimize skill automatically:

  • "optimize"
  • "hill climb"
  • "improve metric"
  • "reduce latency"
  • "optimize skill"
  • "optimize prompt"
  • "eval mode." disable-model-invocation: true"

Or invoke explicitly:

Skill("Optimize")

References & Credits

The thinkers, books, frameworks, and research this skill is built on. The ideas belong to them — the integration belongs to PAI.

Want PAI to do this for you?

Install PAI on your machine — your DA gets the Optimize skill plus 44 others, all hooked into one Life OS.