Other Standard

AudioEditor

AI audio editing pipeline: Whisper word-level transcription → Claude segment classification (KEEP/CUT_FILLER/CUT_FALSE_START/CUT_STUTTER/CUT_DEAD_AIR) → ffmpeg with 40ms qsin crossfades and room-tone fill → optional Cleanvoice cloud polish.

Workflow

References

Triggers

medium

Effort

The Problem

Raw recordings are full of garbage: filler words, dead air, false starts, stutters. You can manually edit in a DAW, but that takes hours per episode and you still end up making judgment calls on every pause — is that silence rhetorical or accidental? Generic AI transcription tools will give you a transcript but won't touch the audio. And most automated cutters are blunt: they'll slice out a deliberate pause for effect the same way they cut a dead-air gap where you lost your train of thought.

How This Skill Approaches It

Run the full pipeline: Whisper transcribes at word-level timestamps, then Claude classifies each segment as KEEP, CUT_FILLER, CUT_FALSE_START, CUT_STUTTER, or CUT_DEAD_AIR — distinguishing a rhetorical pause from accidental silence before any cut is made. ffmpeg executes the edits with 40ms qsin crossfades at every cut point and fills gaps with extracted room tone so the edits don't click. Breaths get attenuated to 50% rather than removed, because fully cutting them sounds unnatural. An optional Cleanvoice API pass handles mouth sounds and loudness normalization. Three modes let you tune aggressiveness: --preview shows proposed edits before touching the file, --aggressive tightens thresholds for denser cleanup, --polish adds the Cleanvoice final pass.

Distinguishes rhetorical from accidental pauses; breaths attenuated 50%
Modes: --preview, --aggressive, --polish
Workflow: Clean

Not for video composition (use Remotion)

In Action

What you say to your DA, and what the AudioEditor skill actually does.

You say "clean up the audio on this podcast recording"

Runs the Clean workflow: Transcribe.ts → Analyze.ts (Claude classifies every segment) → Edit.ts (ffmpeg cuts with 40ms crossfades and room-tone fill) → outputs cleaned MP3.
You say "show me what edits you'd make before you touch the file"

Runs Clean with --preview: transcribes and classifies all segments, shows the proposed cut list with timestamps and reasons, makes no changes until you approve.

Inside the Skill

The thinking, frameworks, and architecture that distinguish this skill from a generic version of the same task.

What It Does

Cleans recorded audio automatically — strips filler words, false starts, stutters, and dead air, attenuates breaths, and crossfades every cut. It transcribes the file at the word level, has Claude classify each segment (KEEP, CUT_FILLER, CUT_FALSE_START, CUT_STUTTER, CUT_DEAD_AIR), then executes the cuts with ffmpeg. An optional Cleanvoice pass adds final polish. Modes: --preview, --aggressive, --polish.

The Problem

Cleaning a recording by hand means scrubbing a waveform for every "um," half-started sentence, and three-second silence, then crossfading each cut so it doesn't click. It's slow and tedious, and a blunt auto-tool over-cuts — it kills the rhetorical pause along with the accidental one, or leaves an audible seam where it spliced. This pipeline tells deliberate pauses apart from dead air, fills gaps with room tone, and crossfades each edit, so the output sounds clean rather than chopped.

How It Works

Whisper produces word-level timestamps, Claude classifies each segment (distinguishing rhetorical emphasis from accidental repetition), and ffmpeg executes the cuts with 40ms qsin crossfades, room-tone gap fill, and breath attenuation at 50% volume rather than removal. An optional Cleanvoice API pass handles mouth-sound removal, residual filler, and loudness normalization.

Pipeline

Audio Input
    |
[Transcribe] Whisper word-level timestamps (insanely-fast-whisper on MPS)
    |
[Analyze] Claude classifies each segment:
    |   KEEP / CUT_FILLER / CUT_FALSE_START / CUT_EDIT_MARKER / CUT_STUTTER / CUT_DEAD_AIR
    |   Distinguishes rhetorical emphasis from accidental repetition
    |
[Edit] ffmpeg executes cuts:
    |   - 40ms qsin crossfades at every edit point
    |   - Room tone extraction and gap filling
    |   - Breath attenuation (50% volume, not removal)
    |
[Polish] (optional) Cleanvoice API final pass:
        - Mouth sound removal
        - Remaining filler detection
        - Loudness normalization

Output: cleaned MP3/WAV

Tools

Tool	Command	Purpose
Transcribe	`bun ${CLAUDE_SKILL_DIR}/Tools/Transcribe.ts <file>`	Word-level transcription via Whisper
Analyze	`bun ${CLAUDE_SKILL_DIR}/Tools/Analyze.ts <transcript.json>`	LLM-powered edit classification
Edit	`bun ${CLAUDE_SKILL_DIR}/Tools/Edit.ts <file> <edits.json>`	Execute cuts with crossfades + room tone
Polish	`bun ${CLAUDE_SKILL_DIR}/Tools/Polish.ts <file>`	Cleanvoice API cloud polish
Pipeline	`bun ${CLAUDE_SKILL_DIR}/Tools/Pipeline.ts <file> [--polish]`	Full end-to-end pipeline

API Keys Required

Service	Env Var	Where to Get
Anthropic (for analyze step)	`ANTHROPIC_API_KEY`	Already set via Claude Code
Cleanvoice (for polish step, optional)	`CLEANVOICE_API_KEY`	cleanvoice.ai Dashboard Settings API Key

Examples

Example 1: Clean a podcast recording

User: "clean up the audio on this podcast file"
-> Invokes Clean workflow
-> Runs full pipeline: transcribe -> analyze -> edit
-> Outputs cleaned MP3 with filler words, stutters, and dead air removed

Example 2: Preview edits before applying

User: "show me what edits you'd make to this recording"
-> Invokes Clean workflow with --preview flag
-> Transcribes and analyzes, shows proposed edits without modifying audio
-> User reviews edit list, then runs again to apply

Example 3: Aggressive clean with cloud polish

User: "aggressively clean this audio and polish it"
-> Invokes Clean workflow with --aggressive --polish flags
-> Tighter thresholds for filler detection
-> Cleanvoice API pass for mouth sounds and normalization

Gotchas

Transcription accuracy varies with audio quality. Background noise, multiple speakers, and accents reduce accuracy.
Cut detection is heuristic-based. Always preview edits before committing — automated cuts can remove intentional pauses.
Cloud polish uploads audio to external service. Confirm the user is okay with cloud processing for sensitive content.

Workflows · 1

01

Clean Workflows/Clean.md

clean audio, edit audio, remove filler words, clean podcast, remove ums, cut dead air, polish audio

How to Invoke

Say any of these to your DA and PAI activates the AudioEditor skill automatically:

"clean audio"
"edit audio"
"remove filler words"
"clean podcast"
"remove ums"
"cut dead air"
"polish audio"
"trim recording"
"cut stutters"

Or invoke explicitly:

Skill("AudioEditor")

Related Skills

References & Credits

The thinkers, books, frameworks, and research this skill is built on. The ideas belong to them — the integration belongs to PAI.

Tools

Want PAI to do this for you?

Install PAI on your machine — your DA gets the AudioEditor skill plus 44 others, all hooked into one Life OS.

Install PAI View on GitHub