The Dev-Workflow Conveyor Belt: Multi-Agent Orchestration Inside the Harness

Every team that builds seriously with AI eventually hits the same ceiling: the context window. The model that wrote the code is also the model that must review it, validate it, document it, commit it, and open the pull request — all while holding the story requirements, the architectural constraints, the existing codebase patterns, and the results of every prior step in its head at the same time. By the time it gets to the commit message, it has forgotten half of what the story asked for. This is the failure mode that produces AI slop. It is not a model problem. It is a workflow problem.

The prior post in this series made the structural case for the agent harness as the pattern that separates production-grade AI from well-prompted prototypes. The post on agentic security examined what the harness enforces at the boundary. This post examines what happens inside the harness when real work gets done. The worked example is the dev-workflow skill — the skill my Personal AI Infrastructure uses to take a user story from intake to a merged pull request, and the single clearest illustration I have of why multi-agent orchestration beats monolithic execution at every quality axis that matters.

The Context Window Is the Constraint — Decomposition Is the Answer

Large language models have finite context windows. Even with a million tokens, the useful context for a specific decision is a much smaller slice. A code reviewer looking at a diff should not also be holding the full sprint backlog, the conventions document, and the release pipeline configuration — that context crowds out the attention that should be spent on the code in front of it. A programmer implementing a feature should not be holding the PR description template and the commit identity rules while it is trying to write a function. The more a single agent is asked to hold, the less deeply it can reason about any one thing.

The dev-workflow skill treats this as the structural problem it is. It decomposes a development cycle into six discrete stations on a conveyor belt. Each station is staffed by a specialized agent — technical-analyst, sprint-programmer, code-reviewer, release-validator, project-manager, release-validator again — and each agent's context is scoped to exactly what that station needs to do. The upstream agent produces an output. The output becomes the input to the downstream agent. The context window of each agent stays tight. The intelligence stays sharp. And the handoff between stations is the contract that makes the whole thing coherent.

This is the humanistic pattern that experienced engineering teams already use. You don't have the same person gather requirements, write code, review it, test it, update the docs, and merge the PR — not because they can't, but because the cognitive switching cost degrades every step. You stage the work. You build in checks. You let each person operate in their zone of competence with a handoff that is clear enough for the next person to pick up cold. The dev-workflow skill is that pattern, codified for AI agents.

Station One: Context Handoff (technical-analyst)

Every story that enters the belt passes through two analysis passes before a single line of code is written. This is the station most teams skip when they first try multi-agent orchestration, and it is the station that eliminates the largest class of downstream bugs.

The first pass — Phase 0.5a — runs once per sprint and asks a simple, structural question: what is actually deployed? It reads package manifests, infrastructure stacks, database migrations, and the active workflows, and produces a concise Sprint Context document that captures the technology fingerprint and source-of-truth locations. Every subsequent agent in the sprint reads against this document. It is the grounded, checked reality that replaces the model's internal guesses about what the project looks like.

The second pass — Phase 0.5b — runs once per story and validates the story's assumptions against that ground truth. It runs a structured checklist across technology stack, database layer, API contracts, frontend patterns, infrastructure constraints, dependency versions, integration points, and the story's own acceptance criteria. Each category returns a verdict: PASS, WARN, or FAIL. If the story claims the project uses Next.js 14 but package.json shows Next.js 15, that is a FAIL. The workflow halts. The operator is surfaced a decision surface — update the story, override with current reality, escalate to the architect, or pause the sprint — before the programmer is ever invoked.

The output of this station is a Story Context Block, 300 to 500 tokens, that will be prepended to every downstream agent's prompt. It carries the infrastructure verdicts, the relevant implementation guides, the risks and flags, and concise notes the programmer can act on immediately. The analyst reads 5 to 12 files. The programmer reads the summary. This is what handoff should feel like: the programmer never has to do the discovery the analyst already did, and the analyst's conclusions are the ground truth the reviewer will measure against.

Station Two: Build (sprint-programmer)

The programmer receives the Story Context Block, the story document, the sprint document, the architecture references, and the acceptance criteria — and nothing else. The prompt is scoped explicitly: only this story's scope, no unrelated refactoring, no unapproved dependencies. If the context block flagged a WARN item, the programmer addresses it in its approach. If it flagged a FAIL item, the programmer was never invoked in the first place.

The tight scope is deliberate. When a programmer agent has visibility into the full sprint, into other stories' concerns, into the commit pipeline and the PR templates, its attention gets diluted. When the scope is one story, verified against one Story Context Block, the agent's full reasoning capacity is available for the actual coding task. The output is a working implementation and a concise file list. The output is not a polished commit message, not a documentation update, not a branch strategy. Those belong to downstream stations.

Station Three: Check — and Iterate (code-reviewer, loop back to sprint-programmer)

The reviewer receives the same Story Context Block the programmer received — and now the implementation. The framing is explicit: the context block defines the infrastructure ground truth the implementation was built against. If the code contradicts a PASS item, that is CRITICAL. If the block flagged a WARN, the reviewer verifies how the programmer addressed it. The block's Risks and Flags section defines the known uncertainties, and the reviewer confirms each one was handled.

The reviewer runs an infrastructure compatibility checklist — CSP compatibility, external resource allowlisting, static export compliance, routing conventions, CloudFront headers, and a behavioral verification pass against the story's acceptance criteria. Every finding is severity-classified: CRITICAL blocks deployment, HIGH should be fixed before merge, MEDIUM is recommended, LOW is optional. The output is one of two verdicts: APPROVED, or CHANGES REQUESTED with a classified issue list.

If the reviewer returns CHANGES REQUESTED with CRITICAL or HIGH issues, the belt loops. The programmer is re-invoked — this time with the issue list as its input and the explicit constraint that it is fixing, not building new features. The reviewer runs again to confirm the fixes. This loop is bounded: after three failed cycles, the workflow escalates to the operator with the full issue list and a choice — force proceed, abandon, or intervene manually. The loop is not infinite. The quality gate is not optional.

This is where the conveyor belt earns its keep. One model writing and reviewing its own code is prone to the same blind spots in both passes — it will approve what it just wrote because the reasoning that produced the code is the reasoning now evaluating it. Two agents, with the same Story Context Block as the shared ground truth but with different roles and different prompts, produce independent judgments. The reviewer is looking for what the programmer missed. The Story Context Block is the spec they both agree on. The disagreement, when it surfaces, is the signal.

Station Four: Validate (release-validator, mode=validate)

Once the reviewer approves, the belt moves to the quality gate. This station does not commit. It does not push. Its job is deterministic machine validation: TypeScript compilation, linting, and — if infrastructure paths changed — a synchronization check to catch stale references to routes that were renamed. On failure, it attempts auto-fix and re-runs. On success, it emits a handoff packet that captures the verdict, the validated files, the lint status, the typecheck status, and any advisory warnings.

The handoff packet is the critical design element here. Phase 4 and Phase 6 are separated by a documentation step, which means the release-validator instance that validates is not the instance that commits. The packet is what bridges them — about 150 tokens of state that makes the stateless hand-off work. This is what allows the workflow to stay linear and auditable even though each agent instance is fresh. The state lives in the orchestrator. The agents are deliberately ephemeral.

Station Five: Document (project-manager)

Documentation runs before the commit, not after. This is a deliberate ordering. When documentation runs after the commit, it becomes a follow-up task that is easily skipped or deferred, and the repository drifts. When documentation runs before the commit, the documentation changes are included in the same atomic commit as the feature code — one commit per story, with the code and the docs in lockstep.

The project-manager takes the story ID, the branch name, the story and sprint and epic document paths, and the Phase 2 acceptance criteria verification result. It updates the story status to COMPLETE, records the acceptance criteria verification outcome, adds the completion date, moves the story to Done in the sprint document, updates the epic progress, and — when the infrastructure detection pattern triggers — expands scope to update the architecture, infrastructure, deployment, or database docs as appropriate. The output is a list of the files it touched. That list flows into the final station as part of the commit.

The key detail: the project-manager does not re-verify the acceptance criteria. The reviewer already did that in Phase 2. The project-manager records the result. This is the discipline that keeps agents from second-guessing each other's work — each agent's verdict is the record of truth for that phase, and the downstream agents build on it rather than re-running it.

Station Six: Ship (release-validator, mode=integrate)

The final station is a fresh release-validator instance — not the one from Phase 4 — that reads the handoff packet plus the docs_updated list, inspects the staged changes, selects the correct git identity based on the repository's remote, crafts a conventional commit message with the story ID trailer, runs the commit through the pre-commit hook chain (never bypassed), pushes the branch, and — when the branch policy requires it — opens the pull request.

Every concern that used to belong to a single monolithic agent is now handled by the specialist that owns it. Git identity selection lives in the release-validator because the credential rules are a shipping concern, not a coding concern. PR creation lives here for the same reason. The commit message format is consistent because the same agent writes it every time against the same contract. The gitleaks pre-commit hook is respected because bypass flags are prohibited in the agent definition. The result on success is a completed SHA, a branch, and a PR URL — reported back to the user with enough context to take the next action.

Why This Beats One Model Doing Everything

The quality outcomes from this pattern are measurable and consistent. Bug density drops because the reviewer is evaluating against the same Story Context Block the programmer was given, which means mismatches surface as CRITICAL issues rather than latent defects. Slop drops because the tight scope at each station means no agent is asked to produce output for a role it was not prompted for — the programmer isn't writing PR descriptions, the reviewer isn't generating commit messages, the validator isn't hallucinating doc updates. Commit hygiene improves because the integration station is specialized for it and has no other job. Documentation stays in sync because it is committed with the code that produced it, atomically, by design.

More importantly, the context window of each agent stays tight. No single agent is holding the full arc of the work. The analyst holds infrastructure state. The programmer holds a story. The reviewer holds an implementation and a spec. The validator holds a diff. The project-manager holds a set of document paths. The integrator holds a commit intent. The intelligence available for each decision is concentrated, not diluted. And when the task changes — swap in a larger model, adjust a prompt, add a new check — the affected station is modified in isolation. The belt keeps running.

This is what the harness enables. The harness is the deterministic scaffolding; the dev-workflow skill is the canonical workflow that runs inside it. The belt, the stations, the handoffs, the iteration loops, the severity classifications, the atomic commits — all of these are structural properties that the harness makes enforceable. None of them can be achieved by a single agent with a clever prompt, because the constraint that matters — scoped context per decision, with a verified handoff between decisions — is not a prompt-level concern. It is an architectural one.

Agentic systems that scale to real engineering work look like this. They don't look like one model holding everything in its head. They look like a team of specialists, each with a narrow job, each producing a checked output, each handing off to the next station with enough context to pick up cold. That is how quality ships. That is how the context window stops being a ceiling and starts being a feature. And that is the pattern every industry that wants to deploy AI against real work needs to internalize before it writes its first autonomous workflow.