Development Standard

Optimize

"Autonomous optimization loop — hill-climb any target.

Workflows

References

Triggers

medium

Effort

The Problem

Ask a generic AI to improve your code's performance or your skill's output quality and you get one suggestion, applied once, with no way to know if it actually helped. There's no measurement loop, no systematic hypothesis testing, and no memory of what was tried. You're editing by feel, not by signal. Skills and prompts are even worse — there's no obvious metric to track, so the AI just rewrites things in ways that sound better without any evidence they are.

How This Skill Approaches It

Optimize runs an autonomous hill-climbing loop with two modes. Metric mode is for code targets: you give it a shell command that produces a number (Lighthouse score, bundle bytes, validation loss), a set of files it can modify, and a time budget per experiment. The loop hypothesizes, modifies, measures, keeps improvements, reverts failures, and repeats — roughly 12 experiments per hour. Eval mode is for skills, prompts, and agents: the system reads the target, auto-generates binary eval criteria and test inputs, presents them for your approval, then runs LLM-as-judge scoring across multiple runs per experiment to reduce noise. ISA guard rails act as invariants across all experiments — if a modification passes the metric but violates a guard rail, it auto-reverts. The loop exits at a configurable experiment budget or time cap and presents a diff with apply/reject/partial options.

Code with metrics, or skills/prompts/agents with LLM-as-judge

In Action

What you say to your DA, and what the Optimize skill actually does.

You say "optimize the lighthouse performance score for my site"

Enters metric mode with the Lighthouse shell command as the measure, sets up a git-branch sandbox on the specified file globs, runs the hypothesis-modify-measure loop targeting the performance score, and at the end presents a diff of every kept change with the score delta.
You say "optimize the extractwisdom skill — it keeps giving me generic output"

Enters eval mode on the ExtractWisdom skill directory, auto-generates binary eval criteria (does the output contain specific domain insights? does it avoid filler phrases?), generates test inputs, runs LLM-as-judge scoring across experiments, and recommends which diffs to apply when done.

Inside the Skill

The thinking, frameworks, and architecture that distinguish this skill from a generic version of the same task.

What It Does

Runs an autonomous optimization loop against any target. The agent modifies the target, measures the result, keeps improvements, discards failures, and repeats until it stops climbing. Two modes: metric mode for code targets that produce a number (latency, bundle size), and eval mode for skills, prompts, or agents judged by LLM-as-judge binary evals.

The Problem

Tuning a thing for a measurable outcome is slow, boring, manual work. You change a file, run the measurement, eyeball whether it got better, keep or revert, then do it again — dozens of times. People give up after a few rounds and settle for "good enough" far short of the real ceiling. The targets without a clean number (a skill's quality, a prompt's effectiveness) are worse: there's no easy way to tell if a change actually helped. This skill runs that whole loop for you and only keeps changes that measurably win.

How It Works

Two modes drive the same hill-climb loop:

Metric mode — code targets with a shell command that produces a number (the original).
Eval mode — skills, prompts, agents, or any text target judged by LLM-as-judge binary evals.

Inspired by Karpathy's autoresearch and extended with LLM-as-judge evaluation.

Invocation

Metric Mode (code targets)

/optimize --metric "lighthouse_score" --higher-is-better \
  --measure "npx lighthouse http://localhost:3000 --output=json" \
  --extract "jq '.categories.performance.score * 100' lighthouse.json" \
  --files "src/**/*.tsx,src/**/*.css" \
  --budget 120

/optimize --resume        # Resume a previous optimization loop
/optimize --status        # Show results summary from last/current run

Eval Mode (skill/prompt/agent targets)

/optimize --target "~/.claude/skills/ExtractWisdom"
/optimize --target "~/.claude/skills/Research/Workflows/QuickResearch.md"
/optimize --target "prompts/my-prompt.md"
/optimize --target "~/.claude/skills/ExtractWisdom" --max-experiments 20

In eval mode, the system automatically:

Detects the target type (skill, prompt, agent, code, function)
Reads the target to understand its purpose and constraints
Generates 3-6 binary eval criteria and 3-5 test inputs
Presents criteria + inputs for your approval before starting
Runs the optimization loop using LLM-as-judge scoring
Presents a recommendation (apply/reject/partial) when done

What Happens

This skill triggers the PAI Algorithm in mode: optimize:

OBSERVE — Define or auto-detect the target, set eval_mode
THINK — Analyze codebase/skill, generate hypothesis queue
PLAN — Prioritize hypotheses by expected impact
BUILD — Phase 0: TARGET ANALYSIS (see optimize-loop.md)
- Detect target type, auto-generate eval criteria (eval mode), set up sandbox, baseline
EXECUTE — The autonomous loop (optimize-loop.md):
- Hypothesize → Modify target → Measure (metric or eval) → Keep/Revert → Repeat
- Metric mode: ~12 experiments/hour (at 5-min budget)
- Eval mode: ~6-8 experiments/hour (multi-run judging is slower)
VERIFY — Phase 9: RECOMMEND — diff, summary, apply/reject/partial options
LEARN — Phase 10: EXTRACT LEARNINGS — what worked, what didn't, structured insights

Arguments — Metric Mode

Argument	Required	Default	Description
`--metric NAME`	yes		Human-readable metric name
`--measure COMMAND`	yes		Shell command that produces the metric
`--files GLOB`	yes		Files the agent may modify (comma-separated)
`--higher-is-better`		(default)	Higher metric values are better
`--lower-is-better`			Lower metric values are better
`--extract COMMAND`		Last number in stdout	Extract metric from output
`--budget SECONDS`		300	Time budget per experiment
`--target VALUE`		none	Stop when metric reaches this value
`--max-experiments N`		none	Stop after N experiments
`--locked GLOB`		none	Files the agent must NOT modify
`--constraints TEXT`		none	Additional rules (e.g., "tests must pass")

Arguments — Eval Mode

Argument	Required	Default	Description
`--target PATH`	yes		Path to skill directory, prompt file, or agent definition
`--max-experiments N`		none	Stop after N experiments
`--runs N`		3	Runs per experiment (more = more reliable, slower)
`--criteria "Q1" "Q2"`		auto-generated	Override auto-generated eval criteria
`--inputs "I1" "I2"`		auto-generated	Override auto-generated test inputs
`--budget SECONDS`		300	Time budget per experiment

Shared Arguments

Argument	Description
`--resume`	Resume a previous optimization run
`--status`	Show results summary

Algorithm Integration

When /optimize is invoked, the Algorithm enters with mode: optimize in the ISA frontmatter. The eval_mode is set based on arguments:

--measure provided → eval_mode: metric (git branch sandbox)
--target provided → eval_mode: eval (directory sandbox)

ISC criteria become guard rails — assertions that must hold true across ALL experiments. Guard rails must REMAIN satisfied perpetually. A violation triggers automatic revert regardless of score improvement.

Reference files:

~/.claude/PAI/ALGORITHM/optimize-loop.md — the full loop protocol
~/.claude/PAI/ALGORITHM/eval-guide.md — how to write good eval criteria
~/.claude/PAI/ALGORITHM/target-types.md — target detection and ISC generation

Examples

Metric Mode

Optimize page load time:

/optimize --metric "lighthouse_perf" --higher-is-better \
  --measure "npx lighthouse http://localhost:3000 --output=json --output-path=lh.json" \
  --extract "jq '.categories.performance.score * 100' lh.json" \
  --files "src/**/*.tsx,src/**/*.css" \
  --target 95 --budget 120

Optimize bundle size:

/optimize --metric "bundle_bytes" --lower-is-better \
  --measure "bun run build 2>&1 && du -sb dist/ | cut -f1" \
  --files "src/**/*.ts" \
  --constraints "all tests must pass"

ML training (Karpathy-style):

/optimize --metric "val_bpb" --lower-is-better \
  --measure "uv run train.py > run.log 2>&1 && grep '^val_bpb:' run.log | cut -d' ' -f2" \
  --files "train.py" \
  --locked "prepare.py" \
  --budget 300

Eval Mode

Optimize a skill's Extract workflow:

/optimize --target "~/.claude/skills/ExtractWisdom" --max-experiments 15

Optimize a standalone prompt:

/optimize --target "prompts/summarize-article.md" --runs 5

Optimize with custom criteria:

/optimize --target "~/.claude/skills/Research/Workflows/QuickResearch.md" \
  --criteria "Does the output contain specific facts with sources?" \
            "Is the output structured with clear sections?" \
            "Does the output avoid generic filler?" \
  --inputs "research quantum computing breakthroughs 2025" \
           "quick research on supply chain security" \
           "find recent developments in AI agents"

Gotchas

Hill-climbing can get stuck in local optima. If score plateaus, consider resetting with different initial conditions.
Eval mode vs metric mode: Use metric mode for quantifiable targets (latency, size). Use eval mode for qualitative targets (skill quality, prompt effectiveness).
Regression tolerance prevents catastrophic changes. Don't set it to 0 — some regression in secondary metrics is acceptable if primary metric improves significantly.

How to Invoke

Say any of these to your DA and PAI activates the Optimize skill automatically:

"optimize"
"hill climb"
"improve metric"
"reduce latency"
"optimize skill"
"optimize prompt"
"eval mode." disable-model-invocation: true"

Or invoke explicitly:

Skill("Optimize")

Related Skills

References & Credits

The thinkers, books, frameworks, and research this skill is built on. The ideas belong to them — the integration belongs to PAI.

Want PAI to do this for you?

Install PAI on your machine — your DA gets the Optimize skill plus 44 others, all hooked into one Life OS.

Install PAI View on GitHub