The Agent-Skill Boundary: When Autonomous Behaviors Become Skills

How to recognize when an agent's autonomous behaviors should be extracted into skills, hooks, or knowledge trees. A layered model for agent design.

You are building an agent. You need to decide: is “always check project config before starting” part of what the agent is, or part of what the agent does?

The question seems simple until you try to answer it. If it’s identity (what the agent is), it belongs in the agent prompt and loads on every invocation. If it’s behavior (what the agent does), it should be extracted into a skill or hook and load only when relevant. Get this wrong and you either bloat your agent prompt with procedural checklist items, or strip out judgment that the agent needs to function correctly.

The boundary is hard to define because the language we use to describe agents obscures the distinction. “The agent always does X” sounds like identity. “The agent has a behavior of doing X” sounds like procedure. Both describe the same capability. Neither tells you where it belongs.

Six months into building an agent, this ambiguity compounds. Your prompt grows from 200 lines of clear routing logic to 700 lines of accumulated “always do X before Y” instructions. The agent spends half its context budget on behaviors that are irrelevant to the current invocation. You cannot tell which instructions are judgment (the agent decides whether to act) and which are procedure (numbered steps that always run the same way).

The problem is not that the agent has too many capabilities. The problem is that you lack a clear boundary between what belongs in the agent (judgment, irreducible) and what belongs outside it (procedure, extractable).

The Accumulation Pattern

Here is how it happens. An agent starts with a clear job: route user requests to the right action. Then edge cases appear:

v1: Route subcommands to the right handler
v2: + "Before running subcommand X, always load reference Y"
v3: + "If the previous run failed, check the failure log first"
v4: + "After completing action Z, run validation W"
v5: + "When launching sub-agents, verify isolation before proceeding"
v6: + "Always check project config before starting any operation"

Each addition is reasonable in isolation. Together, they create an agent that is half router, half checklist runner, and bad at both. The routing logic is buried under procedure. The procedure is invisible - there is no way to observe, test, or disable individual behaviors without reading the entire prompt.

flowchart TB subgraph v1["v1: Clean Router (2 steps)"] direction TB v1a[Parse subcommand] v1b[Dispatch to handler] v1a --> v1b end subgraph v6["v6: Accumulated Behaviors (7 steps)"] direction TB v6a["1. Check project config
(added v6)"] v6b["2. Load failure log
(added v3)"] v6c["3. Parse subcommand
(original)"] v6d["4. Load reference files
(added v2)"] v6e["5. Verify isolation
(added v5)"] v6f["6. Dispatch to handler
(original)"] v6g["7. Run validation
(added v4)"] v6a --> v6b v6b --> v6c v6c --> v6d v6d --> v6e v6e --> v6f v6f --> v6g end note1["Simple status check:
executes all 7 steps
only needs steps 3 + 6"] v1 -.->|"grew into"| v6 v6 --- note1 style v1 fill:#3A4C43,stroke:#6b7280,color:#f0f0f0 style v6 fill:#4C3A3C,stroke:#6b7280,color:#f0f0f0 style note1 fill:#3A4A5C,stroke:#6b7280,color:#f0f0f0 style v6a fill:#4C3A3C,stroke:#C24F54,color:#f0f0f0 style v6b fill:#4C3A3C,stroke:#C24F54,color:#f0f0f0 style v6d fill:#4C3A3C,stroke:#C24F54,color:#f0f0f0 style v6e fill:#4C3A3C,stroke:#C24F54,color:#f0f0f0 style v6g fill:#4C3A3C,stroke:#C24F54,color:#f0f0f0

The v6 agent loads every behavior on every invocation. A user who runs a simple status check pays the same context cost as a user who launches a complex multi-agent operation.

The Extraction Test

Not every behavior should be extracted. The test is: does this behavior require judgment, or is it procedure?

Procedure has a fixed trigger and fixed steps. It always runs the same way. It can be described as a numbered list. It does not need to see the rest of the prompt to work correctly.

Judgment requires context, interpretation, and decision-making. It depends on what came before and what might come next. It cannot be reduced to a numbered list because the right action changes based on circumstances.

SignalProcedure (extract it)Judgment (keep it)
Triggered by a recognizable pattern+
Same steps every time+
Could be a script or hook+
Requires reading runtime state first+
Different action depending on context+
Judgment call about whether to act+
The test applies recursively. Inside a judgment call, there may be procedural steps that can be extracted. Inside a procedure, there may be a judgment call that should stay in the agent.

As a decision filter:

flowchart TB start["New behavior to add"] --> q1{"Can a pattern\ntrigger it?"} q1 -->|Yes| hook["HOOK\ndeterministic, pre-model"] q1 -->|No| q2{"Is it\nnumbered steps?"} q2 -->|Yes| skill["SKILL / REFERENCE\nprocedure, on-demand"] q2 -->|No| q3{"Does it require\n'it depends'?"} q3 -->|Yes| agent["AGENT\njudgment, keep it"] q3 -->|No| cut["CUT IT\nprobably redundant"] style hook fill:#3A4C43,stroke:#6b7280,color:#f0f0f0 style skill fill:#4C4538,stroke:#6b7280,color:#f0f0f0 style agent fill:#3A4A5C,stroke:#6b7280,color:#f0f0f0 style cut fill:#4C3A3C,stroke:#6b7280,color:#f0f0f0 style start fill:#3A4A5C,stroke:#6b7280,color:#f0f0f0 style q1 fill:#3A4A5C,stroke:#6b7280,color:#f0f0f0 style q2 fill:#3A4A5C,stroke:#6b7280,color:#f0f0f0 style q3 fill:#3A4A5C,stroke:#6b7280,color:#f0f0f0

The fourth outcome - “cut it” - is real. If a behavior is not triggered by a pattern, is not a procedure, and does not require judgment, it is probably restating something that already exists elsewhere in the system.

Extraction creates the problem progressive disclosure solves: Once you extract procedural behaviors into skills and references, you need a way to load them efficiently. Loading everything on every invocation wastes context. Trusting the model to load selectively fails ~15% of the time. Deterministic progressive disclosure solves this - extracted behaviors load only when needed, with infrastructure guarantees that loading happens correctly.

The Layered Model

Once you can distinguish procedure from judgment, the architecture becomes clear. There are four layers, and each has a different enforcement model:

flowchart TB subgraph hooks["Hooks (Deterministic)"] h1[Pre-invocation triggers] h2[Post-execution validation] end subgraph agent["Agent (Judgment)"] a1[Which action?] a2[Is the result correct?] a3[How to recover?] end subgraph skills["Skills (Procedure)"] s1[Step-by-step execution] s2[Defined inputs and outputs] end subgraph knowledge["Knowledge (On-Demand)"] k1[Reference material] k2[Examples and templates] end hooks --> agent agent --> skills skills --> knowledge style hooks fill:#3A4C43,stroke:#6b7280,color:#f0f0f0 style agent fill:#3A4A5C,stroke:#6b7280,color:#f0f0f0 style skills fill:#4C4538,stroke:#6b7280,color:#f0f0f0 style knowledge fill:#4C3A3C,stroke:#6b7280,color:#f0f0f0

Layer 1: Hooks (deterministic, pre-model)

Hooks fire before the model runs. No judgment involved. A hook matches a pattern and injects content, blocks an action, or validates output. The model never decides whether the hook should run.

1
2
3
4
5
# UserPromptSubmit hook: if the prompt contains "/saw program",
# inject references/program-flow.md before the model sees it
triggers:
  - match: "^/saw program"
    inject: references/program-flow.md

Hooks handle “always do X when you see Y” behaviors. If a behavior has a recognizable trigger and a fixed response, it is a hook, not an agent instruction.

Layer 2: Agent (judgment, irreducible)

The agent decides which skill to invoke, whether the result is correct, and how to recover from failure. This is the irreducible core - the part that cannot be a script or hook because it requires interpretation.

Good agent prompts are short. They define the agent’s role, the decisions it makes, and the tools it can use. They do not contain step-by-step procedures for every possible action.

1
2
3
4
5
6
7
8
You are the Orchestrator. You route user requests to the appropriate
skill, verify results, and handle failures.

Decisions you make:
- Is this a new operation or a resume of a previous one?
- Which skill handles this subcommand?
- Did the skill produce a valid result?
- How should failures be recovered?

Layer 3: Skills (procedure, on-demand)

Skills contain the step-by-step logic for a specific operation. They load only when invoked. They have defined inputs and outputs. They can be tested independently.

The Agent Skills spec formalizes this with progressive disclosure: metadata loads at startup (~100 tokens), the skill body loads on activation (<5000 tokens), and reference files load on demand (varies). A well-structured skill pays only for what the current invocation needs.

Layer 4: Knowledge (reference material, loaded by trigger)

Knowledge is the reference material that skills need at specific points. It is not loaded by default. It loads when a trigger matches (via hooks or skill instructions) or when the agent requests it based on runtime state.

The distinction between skills and knowledge: a skill defines what to do. Knowledge provides the information needed to do it. A skill might say “handle the failure according to the routing table.” The routing table is knowledge - a reference file loaded on demand.

A Worked Example

Scout-and-Wave (SAW) is a parallel agent coordination system. Its orchestrator prompt started at 300 lines, grew to 703 lines, and was refactored back to 310 lines by extracting procedure into skills and knowledge.

Here is what each layer handles:

LayerSAW ExampleLoaded When
Hookinject_skill_context loads program-flow.md when it sees /saw programEvery /saw program invocation (deterministic)
AgentOrchestrator decides: new scout, wave resume, or blocked agent retry?Every invocation (judgment)
SkillWave loop steps 1-11: prepare worktrees, launch agents, merge, verify/saw wave invocations only
Knowledgefailure-routing.md: E7a retry, E19 failure types, E20 stub scanningOnly when an agent reports non-complete status
flowchart LR subgraph hook_layer["Hook Layer"] trigger["UserPromptSubmit hook"] trigger -->|"/saw program"| inject["Inject program-flow.md"] trigger -->|"/saw amend"| inject2["Inject amend-flow.md"] trigger -->|"/saw wave"| nothing["No injection"] end subgraph agent_layer["Agent Layer"] judge["Orchestrator judgment"] judge -->|"new feature"| scout["Launch Scout"] judge -->|"wave ready"| wave["Execute wave loop"] judge -->|"agent failed"| recover["Load failure routing"] end subgraph skill_layer["Skill Layer"] wave_proc["Wave procedure"] wave_proc --> prep["Prepare worktrees"] prep --> launch["Launch agents"] launch --> merge["Merge and verify"] end subgraph knowledge_layer["Knowledge Layer"] failure["failure-routing.md"] program["program-flow.md"] amend["amend-flow.md"] end style hook_layer fill:#3A4C43,stroke:#6b7280,color:#f0f0f0 style agent_layer fill:#3A4A5C,stroke:#6b7280,color:#f0f0f0 style skill_layer fill:#4C4538,stroke:#6b7280,color:#f0f0f0 style knowledge_layer fill:#4C3A3C,stroke:#6b7280,color:#f0f0f0

Before extraction, the orchestrator prompt contained all of this inline - 703 lines loaded on every invocation. After extraction, the per-invocation cost depends on what the user asked for:

flowchart LR subgraph before["Before: Every Invocation"] b1["703 lines loaded\nregardless of subcommand"] end subgraph after["After: Pay for What You Use"] a1["Agent core\n310 lines"] --> a2["+ matching skill\n0-250 lines"] a2 --> a3["+ knowledge\n0-55 lines"] end subgraph examples["Cost by Subcommand"] e1["/saw status\n310 lines"] e2["/saw wave\n310 lines"] e3["/saw program execute\n560 lines"] e4["/saw wave + failure\n365 lines"] end style before fill:#4C3A3C,stroke:#6b7280,color:#f0f0f0 style after fill:#3A4C43,stroke:#6b7280,color:#f0f0f0 style examples fill:#3A4A5C,stroke:#6b7280,color:#f0f0f0

The core prompt is 310 lines of routing and judgment. The remaining 393 lines load only when needed.

What Was Extracted

Program commands (250 lines) became references/program-flow.md. These are step-by-step procedures for /saw program plan, /saw program execute, and /saw program status. They have a clear trigger (^/saw program) and fixed steps.

Failure routing (55 lines) became references/failure-routing.md. This is a decision table for handling agent failures: retry logic, failure type classification, post-merge integration. It loads only when an agent reports non-complete status - a runtime condition, not a dispatch-time trigger, so it is loaded by the agent’s judgment rather than a hook.

Amend flow (38 lines) became references/amend-flow.md. Three variants of the amend subcommand, each with orchestrator steps. Clear trigger, fixed procedure.

What Was Cut

108 lines of content were removed entirely:

  • A catalog of CLI commands that were already named at their point of use
  • Example JSON configs that duplicated prose explanations
  • Example YAML that restated the same information as surrounding text
  • Example output blocks that the model generates on its own

These were not skills or knowledge. They were redundant content that inflated the prompt without adding information.

What Stayed

The core prompt kept: invocation routing, pre-flight validation, the wave execution loop, and bootstrap flow. These are judgment-heavy - they require reading IMPL doc state, deciding which agents to launch, evaluating completion reports, and choosing between retry, escalation, or progression. They cannot be extracted because the right action depends on context.

Anti-Patterns

The 700-line prompt

Every possible action described inline. Every reference loaded on every invocation. The agent cannot distinguish what matters for the current request from what exists for other requests.

The fix: extract procedure into skills, knowledge into references, and deterministic routing into hooks. The core prompt should be judgment and routing only.

Invisible routing

“If the user mentions failure, load the failure handling guide.” This instruction competes with hundreds of other instructions for the model’s attention. It might be followed. It might not. There is no way to verify it happened without reading the full conversation.

The fix: if the trigger is recognizable at invocation time, use a hook. If it depends on runtime state, make it an explicit skill invocation rather than a buried instruction.

The “always do X” instruction

“Always check the project configuration before starting.” This runs on every invocation regardless of relevance. A status check does not need project configuration. A wave execution does.

The fix: move conditional prerequisites into the skills that need them. The wave skill checks configuration. The status skill does not. The agent prompt does not mention configuration at all.

Convention-based loading

“When you need reference X, read file Y.” The model decides when it “needs” something. This works until it does not - the model skips the reference, loads it too early, or loads everything at once.

The fix: deterministic progressive disclosure . Hooks load references based on trigger patterns before the model runs. The model receives the reference in context without deciding to load it.

The Extraction Checklist

When reviewing an agent prompt, look for these signals:

  1. “Always” or “before” - “Always check X” or “Before doing Y, run Z.” If X or Z has a recognizable trigger, extract it to a hook. If it only applies to some operations, move it into those operations’ skill definitions.

  2. Numbered steps - Any sequence of 3+ steps that runs the same way every time is a skill. Extract it to a reference file and load it on demand.

  3. Decision tables - “If type is A, do X. If type is B, do Y.” If the table is looked up by a known key, it is knowledge. Extract it to a reference file.

  4. Example blocks - Example outputs, example configs, example YAML. If the model generates its own output, it does not need examples of what that output looks like. Cut them.

  5. Duplicate explanations - Prose that restates what code or config already says. If a field is named webhook_url and the prose says “the URL for the webhook,” the prose adds no information. Cut it.

  6. Runtime-conditional blocks - “If the previous run failed, do X.” These cannot be hooks (hooks fire before execution). They are either skill logic (if the skill handles failure) or agent judgment (if recovery requires interpretation).

The Boundary Principle

An agent that is purely a router is a switch statement. You do not need an LLM for that - a hook or a script handles it deterministically. An agent that is purely procedural is a script. You do not need an LLM for that either.

The agent’s value is the judgment layer between routing and procedure: understanding intent, evaluating results, adapting to unexpected states, and deciding what to do when the happy path breaks.

Design the layers so that:

  • Hooks handle what is deterministic
  • Skills handle what is procedural
  • Knowledge loads what is needed
  • The agent handles what requires judgment

The agent’s prompt should be short because most of what it does is delegate. The total system can be arbitrarily complex - but the complexity lives in skills and knowledge, loaded on demand, not in the agent’s core prompt.

flowchart TB subgraph before["Before Extraction"] direction TB b1["Every invocation loads:
703 lines (all behaviors)"] b2["Status check: 703 lines"] b3["Complex operation: 703 lines"] b4["Failure recovery: 703 lines"] end subgraph after["After Extraction"] direction TB a1["Agent prompt: 300 lines
(always loaded)"] a2["Status check:
300 lines only"] a3["Complex operation:
300 (agent)
+ 250 (skill)
= 550 lines"] a4["Failure recovery:
300 (agent)
+ 250 (skill)
+ 55 (knowledge)
= 605 lines"] end note1["Hooks: 11 scripts
Never loaded by model
(run pre-model)"] before -.->|"extraction"| after after --- note1 style before fill:#4C3A3C,stroke:#6b7280,color:#f0f0f0 style after fill:#3A4C43,stroke:#6b7280,color:#f0f0f0 style note1 fill:#3A4A5C,stroke:#6b7280,color:#f0f0f0 style b1 fill:#4C3A3C,stroke:#C24F54,color:#f0f0f0

The total capability is ~800 lines. The per-invocation cost ranges from 300 (simple status check) to 550 (complex wave with failure). Before extraction, it was 703 lines on every invocation regardless of what the user asked for.

Progressive disclosure is a design principle, not a performance optimization. When the agent sees only what is relevant to the current task, it makes better judgments. When procedure is isolated in skills, it can be tested and versioned independently. When routing is deterministic, it can be observed and debugged without reading the model’s internal reasoning.

The boundary between agent and skill is the boundary between judgment and procedure. Find it, enforce it, and both sides get better.

Quick Reference

Bookmark this section. Come back to it when reviewing an agent prompt.

The four layers:

LayerHandlesEnforcementLoaded
HooksDeterministic routingInfrastructure (pre-model)Never by model
AgentJudgment and decisionsPrompt (irreducible)Always
SkillsStep-by-step procedureOn-demand (skill activation)When matched
KnowledgeReference materialOn-demand (trigger or request)When needed

The decision filter:

  1. Can a pattern trigger it? - Make it a hook
  2. Is it numbered steps? - Make it a skill or reference
  3. Does it require “it depends”? - Keep it in the agent
  4. None of the above? - Cut it

The extraction checklist:

  • “Always” or “before” instructions - hook or move into skill
  • Numbered step sequences (3+) - reference file
  • Decision tables - knowledge reference
  • Example outputs - cut (model generates its own)
  • Duplicate prose - cut
  • Runtime-conditional blocks - skill logic or agent judgment

The test: If you can write it as a numbered list, it belongs in a skill. If it requires “it depends,” it belongs in the agent.