Bespoke Agentic Pipelines overview

Paper · Full text

Bespoke Agentic Pipelines

Abstract

The software industry is converging on a vision of AI-assisted development that assumes generality: one agent, one system, one factory that can handle any codebase, any workflow, and any engineering challenge. This paper argues that vision is wrong. Drawing on work across more than 30 client engagements at LoopQA, we present evidence that the highest-leverage use of agentic AI in software engineering is not the generalized system. It is the bespoke pipeline — a purpose-built chain of AI-supported stages designed around the specific codebase, product, process, and constraints of a single project.

While certain structural elements of agentic pipelines are reusable — agent invocation layers, role-based permissions, manifest-driven decomposition, review loops, and verification stages — the most important parts of every pipeline we have built are the parts that are specific to the project. The prompts encode project-specific conventions. The verification logic encodes project-specific quality standards. The decomposition strategy encodes project-specific architectural boundaries. The review criteria encode project-specific anti-patterns learned from prior failures. These elements cannot be generalized without destroying the properties that make them effective.

Our findings suggest that bespoke pipelines produce exponentially better outcomes than generalized agent systems. In environments where the pipeline was designed around the specific codebase and its challenges, we observed up to a 17x increase in automation output and more than a 20x improvement in engineering leverage relative to unstructured AI usage. Those gains did not come from better models. They came from better pipeline design — design that encoded the knowledge, standards, and failure modes of the specific project the pipeline was built to serve.

This paper presents a methodology for designing bespoke agentic pipelines. We describe the structural components that are genuinely reusable, the components that must be customized, the design decisions that determine pipeline effectiveness, and the organizational conditions required to make bespoke pipeline development practical. We argue that the pursuit of the god factory is a distraction, and that the real opportunity is in building pipelines that know your codebase as well as your best engineer does.

1. Introduction

There is a gold rush happening in AI-assisted software development, and most of it is pointed in the wrong direction.

The dominant narrative is convergent. Build one system. Make it general. Give it tools, memory, and enough autonomy, and it will handle any software engineering task in any codebase. The vision is compelling: a god factory — a single agentic system so capable and so flexible that it becomes the universal solution to software development automation. Companies are raising hundreds of millions of dollars to build these systems. The pitch is always the same: our agent can do everything.

Over the last two years, LoopQA has worked with more than 30 client teams using AI agents for software development, test automation, and quality engineering in production environments. Those engagements covered a wide range of codebases, tech stacks, team structures, organizational constraints, and engineering cultures. Some teams built their own pipelines. Some used generalized agent products. Some tried both.

The results were not ambiguous. Teams that built bespoke pipelines — purpose-built agentic workflows designed around their specific codebase, conventions, failure modes, and quality standards — consistently outperformed teams that relied on generalized agent systems. The margin was not small. In our highest-performing engagements, bespoke pipelines produced 10-20x more useful output than generalized approaches applied to the same codebase with the same model and the same level of AI access.

That observation is the basis for this paper. We do not argue that generalized agents are useless. We argue that they are dramatically less effective than bespoke pipelines for the specific task of producing real, mergeable, production-quality engineering work. We argue that the god factory model fails for fundamental reasons that better models will not fix. And we argue that the highest-leverage investment an engineering team can make in AI is not adopting a general-purpose agent product. It is building a pipeline that knows their codebase.

The paper has three aims. The first is to explain why generalized agent systems systematically underperform bespoke pipelines. The second is to present a practical framework for designing and building bespoke pipelines that encode project-specific knowledge. The third is to identify which pipeline components are genuinely reusable and which must be customized — giving teams a realistic starting point for building their own.

2. The God Factory Problem

The idea of a god factory is not new. Manufacturing has wrestled with the tension between generalization and specialization for more than a century. The analogy is instructive.

2.1 Ford Does Not Build All Cars in One Factory

Ford Motor Company operates more than 50 manufacturing plants worldwide. Each plant is purpose-built. The Dearborn Truck Plant builds F-150s. The Flat Rock Assembly Plant builds Mustangs. The Ohio Assembly Plant builds Super Duty trucks. These plants share underlying principles — assembly line flow, quality checkpoints, standardized tooling interfaces — but the specific configuration of each plant is determined by what it builds.

Ford does not build all of its vehicles in one factory. The reason is not that Ford lacks the engineering talent to design a universal factory. The reason is that a universal factory would be worse at building F-150s than a factory designed to build F-150s, and worse at building Mustangs than a factory designed to build Mustangs. The specialization is not a limitation. It is the source of quality, efficiency, and throughput.

The same principle applies to agentic software development pipelines. A pipeline designed for a specific codebase — one that encodes the project's routing model, auth strategy, data layer, testing conventions, deployment patterns, third-party dependencies, and known failure modes — will always outperform a generalized agent system that must discover those things from scratch on every task.

This is not a controversial claim in manufacturing. It should not be controversial in software engineering either.

2.2 Why Generalized Agent Systems Underperform

We have observed generalized agent systems fail in predictable, repeating patterns across client engagements. The failures fall into several categories.

Context starvation. A generalized agent starts every task with little or no project-specific knowledge. It must read files, infer conventions, guess at patterns, and hope that its assumptions are correct. This is fundamentally wasteful. The information about how the project works already exists — in the codebase, in the documentation, in the heads of the engineers. A bespoke pipeline encodes that knowledge directly. A generalized agent rediscovers it, imperfectly, every time.

Convention drift. When a generalized agent writes code, it writes code that looks plausible. It follows common patterns from its training data. But "common patterns" are not the same as "this project's patterns." A project may use a specific Page Object Model convention. It may structure its API routes in a non-standard way. It may have a custom authentication flow, a particular naming scheme for test files, a specific approach to seed data. A generalized agent will not know these things. It will produce code that works but does not match the project's conventions. Over time, this convention drift degrades the codebase.

Verification weakness. Generalized agent systems typically rely on the agent itself to determine whether its output is correct. This is structurally unsound. An agent that wrote the code is the worst judge of whether the code is correct. It has the same blind spots on review that it had during generation. Bespoke pipelines solve this by separating the executor from the verifier and by implementing project-specific verification logic that encodes what "correct" means for this particular codebase.

Anti-pattern blindness. Every mature codebase has anti-patterns — things that have been tried, failed, and learned from. A generalized agent does not know about these failures. It will cheerfully reproduce patterns that the team has already discovered do not work. A bespoke pipeline encodes these lessons as explicit constraints: "Do not create separate handler files for each route. Use one generic handler with a config map." "Do not use CSS selectors for test locators. Use data-testid attributes." "Do not mock the database in integration tests — the mock diverged from production last quarter and we missed a migration bug."

Decomposition mismatch. Generalized agents decompose work based on general heuristics. They do not know the architectural boundaries of your system. They do not know which modules are tightly coupled. They do not know which areas are high-risk. They do not know that the payments service and the notification service share a data model that must be changed atomically. A bespoke pipeline encodes this decomposition knowledge explicitly, through manifests and dependency graphs that reflect the actual structure of the system.

Role confusion. A generalized agent is everything at once: planner, implementer, reviewer, debugger. This violates the principle of separation of concerns. A planner that can also execute will start implementing before the plan is reviewed. A reviewer that can also edit will fix problems instead of reporting them. A bespoke pipeline assigns distinct roles with distinct permissions and distinct instructions, preventing the kind of role confusion that degrades output quality.

These failures are not model failures. They are architectural failures. Better models will produce more fluent code, more sophisticated reasoning, and more accurate pattern matching. But they will not solve the fundamental problem of generalization: a system that knows nothing specific about your project cannot perform as well as a system that knows everything specific about your project.

2.3 The Spectrum of Pipeline Specificity

Not all project-specific knowledge is equally important. In our experience, pipeline components fall along a spectrum from fully reusable to fully bespoke.

Table 1. Pipeline component specificity spectrum

Component Reusability Why
Agent invocation layer Fully reusable Spawning a subprocess is generic infrastructure
Worktree isolation Fully reusable Git worktrees work the same everywhere
Session artifact management Mostly reusable Directory layout may vary, but the pattern is universal
Parallel execution engine Mostly reusable Concurrency control is generic; what to parallelize is project-specific
Manifest schema Partially reusable Structure is reusable; decomposition logic is project-specific
Role definitions Partially reusable Role names transfer; tool permissions need tuning per project
Review loop structure Partially reusable Loop mechanics are reusable; review criteria are project-specific
Health check suite Partially reusable "Run type check, run tests" is universal; which tests and what thresholds are project-specific
Verification logic Mostly bespoke What "correct" means is defined by the project's standards
Prompt templates Mostly bespoke Must encode project-specific conventions, anti-patterns, and rules
Decomposition strategy Mostly bespoke Depends on architectural boundaries and risk profile
External system integration Fully bespoke Ticket systems, dashboards, and CI configurations are unique

This table is not a suggestion to build everything from scratch. It is a map. The left side of the table is the starting kit. The right side is where the work is. Teams that focus all their energy on the reusable infrastructure and neglect the bespoke components end up with a pipeline that runs smoothly but produces mediocre output. Teams that focus on encoding project-specific knowledge into prompts, verification logic, and decomposition strategy end up with a pipeline that produces work their best engineers would actually merge.

3. The Anatomy of a Bespoke Pipeline

A bespoke pipeline is not a monolithic program. It is a composition of structural components, each of which can be configured, extended, or replaced depending on the project's requirements. This section describes the components that make up a pipeline and the design decisions that determine how each component is customized.

3.1 Structural Overview

Every pipeline we have built follows a similar high-level shape:

INTAKE → PLAN → RESEARCH → EXECUTE → VERIFY → REVIEW → REVISE → SHIP

Within this shape, the specifics vary enormously. A TDD pipeline iterates through RED-GREEN-REFACTOR cycles in batches. A maintenance pipeline pulls a commit, classifies its impact, and updates affected tests. An audit pipeline scans a test suite for flakiness, duplication, and weak assertions. A feature pipeline researches a product requirement, generates coverage across multiple test layers, and prepares a structured pull request.

The shape is not the pipeline. The pipeline is what happens inside each stage — the prompts, the verification logic, the decomposition rules, the role permissions, and the project-specific constraints that determine what "good output" looks like.

Figure 1. Pipeline shape with bespoke customization points

flowchart LR
    A[Intake] --> B[Plan]
    B --> C[Research]
    C --> D[Execute]
    D --> E[Verify]
    E --> F[Review]
    F --> G{Pass?}
    G -- Yes --> H[Ship]
    G -- No --> I[Revise]
    I --> D

    style A fill:#2d3748,color:#fff
    style B fill:#2d3748,color:#fff
    style C fill:#2d3748,color:#fff
    style D fill:#4a5568,color:#fff
    style E fill:#4a5568,color:#fff
    style F fill:#4a5568,color:#fff
    style G fill:#1a202c,color:#fff
    style H fill:#2d3748,color:#fff
    style I fill:#4a5568,color:#fff

3.2 The Agent Invocation Layer

This is the most fundamental component. It spawns a coding agent as a subprocess, passes it a prompt and a set of allowed tools, waits for completion, and handles timeouts and failures. This component is fully reusable across projects.

import { spawn } from "node:child_process";

function runClaude(
  prompt: string,
  allowedTools: string[],
  timeoutMs = 15 * 60 * 1000,
): Promise<void> {
  return new Promise((resolve, reject) => {
    const child = spawn("claude", [
      "-p", prompt,
      "--allowedTools", allowedTools.join(" "),
    ], {
      stdio: ["ignore", "inherit", "inherit"],
      env: { ...process.env },
    });

    const timer = setTimeout(() => {
      child.kill("SIGTERM");
      setTimeout(() => child.kill("SIGKILL"), 5_000);
    }, timeoutMs);

    child.on("close", (code) => {
      clearTimeout(timer);
      code === 0
        ? resolve()
        : reject(new Error(`Agent exited with code ${code}`));
    });
  });
}

The invocation layer is infrastructure. It does not determine the quality of the pipeline's output. What determines quality is what you pass into the prompt parameter and what tools you list in allowedTools. Those decisions are where the bespoke design begins.

3.3 Role-Based Tool Permissions

Different stages of a pipeline need different capabilities. A planner should not be editing files. A reviewer should not be writing production code. A verifier should not be modifying tests. Role-based tool restrictions enforce this structurally.

type AgentRole = "planner" | "researcher" | "executor" | "reviewer" | "verifier";

const ROLE_TOOLS: Record<AgentRole, string[]> = {
  planner:    ["Read", "Glob", "Grep", "Write", "Bash"],
  researcher: ["Read", "Glob", "Grep", "Write", "Bash"],
  executor:   ["Read", "Glob", "Grep", "Write", "Edit", "Bash"],
  reviewer:   ["Read", "Glob", "Grep", "Write", "Bash"],
  verifier:   ["Read", "Glob", "Grep", "Bash"],
};

The role names and tool mappings are partially reusable. Most projects need something like a planner, an executor, and a reviewer. But the specific tool sets need adjustment. In one project, the reviewer may need Edit access because the review workflow includes inline annotations. In another, the researcher may need Bash access to query a database. In a third, the planner may be restricted to Read-only access because the team wants a plan they can inspect without worrying about premature changes.

These are design decisions, not implementation details. They encode the team's philosophy about how work should be decomposed and controlled.

3.4 Worktree Isolation

Every pipeline run should happen in its own git worktree. This keeps the pipeline's work isolated from whatever else is happening in the main repo and from other concurrent pipeline runs.

function createWorktree(opts: {
  slug: string;
  baseBranch: string;
}): { branchName: string; worktreePath: string } {
  const timestamp = new Date().toISOString().slice(0, 10);
  const branchName = `pipeline/${timestamp}-${opts.slug}`;
  const repoDir = resolve(".");
  const worktreePath = resolve(
    repoDir, "..", `${basename(repoDir)}--pipeline-${opts.slug}`
  );

  execSync(
    `git worktree add -b "${branchName}" "${worktreePath}" "${opts.baseBranch}"`,
    { stdio: "inherit" },
  );

  // Symlink node_modules to avoid redundant installs
  for (const f of ["node_modules", ".env", ".env.local"]) {
    const src = join(repoDir, f);
    const dst = join(worktreePath, f);
    if (existsSync(src) && !existsSync(dst)) {
      symlinkSync(src, dst);
    }
  }

  return { branchName, worktreePath };
}

Worktree isolation is fully reusable infrastructure. The only project-specific decisions are the branch naming convention and the shared resource symlink policy.

3.5 Session Artifacts

Every pipeline run produces a structured artifact directory. This directory holds the manifest, prompts, research notes, review results, and any other intermediate output. Without artifacts, a pipeline run is a black box.

.dev-loop/
  current -> 2026-03-07T01-40-32
  2026-03-07T01-40-32/
    manifest.json
    planner-prompt.md
    review-round-1.md
    replan-1.md
    steps/
      step-1-research.md
      step-1-result.md
      step-1-review-1.md

The layout is reusable. What goes into each file is project-specific.

3.6 The Manifest

The manifest is the pipeline's memory. It describes the planned work, tracks what has been done, and tells the orchestrator what to do next.

interface ManifestStep {
  id: string;
  title: string;
  context: string;
  status: "pending" | "in_progress" | "done" | "skipped";
  dependsOn?: string[];
  testFiles?: string[];
}

interface Manifest {
  feature: string;
  steps: ManifestStep[];
  nextId: number;
}

The manifest schema is partially reusable. Every project needs steps with statuses and dependencies. But the granularity of steps, the dependency structure, and the decomposition logic are entirely project-specific. In a monolithic frontend application, each step might correspond to a React component and its associated tests. In a microservice architecture, each step might correspond to a service boundary. In a data pipeline, each step might correspond to a transformation stage. The manifest reflects the architecture of the system, not the architecture of the pipeline.

3.7 Pipeline-Owned Verification

This is the single most important design principle. The pipeline itself must verify the agent's work. You do not trust the agent to tell you the tests pass. You run the tests yourself and check the exit code.

function runTestExpectingPass(testFile: string) {
  try {
    execSync(`npx vitest run "${testFile}" 2>&1`, {
      encoding: "utf-8",
      timeout: 60_000,
    });
    return { passed: true };
  } catch {
    return { passed: false };
  }
}

This is where bespoke design matters most. The verification logic defines what "correct" means. For a TDD pipeline, it means the test fails in the RED phase and passes in the GREEN phase. For a maintenance pipeline, it means the full suite passes after the change. For an audit pipeline, it means the report accurately reflects the state of the codebase. For a migration pipeline, it means the type checker is clean and the application boots.

Generic verification — "did the tests pass?" — is necessary but not sufficient. Bespoke verification adds project-specific checks: stub test detection, duplication scanning, convention compliance, and structural validation. These checks encode the project's standards in a way that a generalized agent cannot.

function detectStubTests(testFile: string): string[] {
  const content = readFileSync(testFile, "utf-8");
  const violations: string[] = [];

  if (/expect\(typeof\s+\w+\)\.toBe\(["']function["']\)/.test(content)) {
    violations.push("typeof/function assertion — tests existence, not behavior");
  }

  if (/expect\([^)]+\)\.toBeDefined\(\)/.test(content)) {
    violations.push("toBeDefined assertion — trivially passes for any export");
  }

  return violations;
}

These detection rules are bespoke. They come from real failures on real projects. One team discovered that the agent was writing existence-only tests that passed trivially. Another discovered that the agent was duplicating Page Object Models. Another found that the agent was using raw CSS selectors instead of the project's data-testid convention. Each discovery became a verification rule in the pipeline.

3.8 Prompt Construction: Where Bespoke Design Lives

The prompts are the pipeline's most important bespoke component. If the structural components are the chassis, the prompts are the engine. A pipeline with good infrastructure and bad prompts produces bad output reliably. A pipeline with rough infrastructure and excellent prompts produces good output inconsistently.

A bespoke prompt is not a one-paragraph instruction. It is a structured document assembled from multiple layers:

  1. Role declaration. What is this agent? What is its scope? What should it not do?
  2. Feature context. What is the overall task? What is this specific step?
  3. Manifest state. What has been done? What is pending? This prevents re-work.
  4. Prior step summaries. What did earlier steps produce? What patterns were established?
  5. Project-specific directives. Naming conventions, anti-patterns, framework rules.
  6. Concrete output instructions. Where to write, what format, what file path.
function buildExecutePrompt(
  step: ManifestStep,
  manifest: Manifest,
  sessionDir: string,
): string {
  const priorWork = summarizeCompletedSteps(sessionDir, manifest);
  const projectRules = loadProjectDirectives();

  return `You are a TDD executor working on the ${manifest.feature} feature.
You implement ONE behavior using red-green-refactor.

## Current step
**Step ${step.id}: ${step.title}**
${step.context}

## Completed steps
${priorWork}

## Full manifest
${formatManifest(manifest)}

${projectRules}

## Your task
1. Write a failing test that asserts the behavior described above.
2. Run the test. Confirm it fails for the RIGHT reason.
3. Write the minimum implementation to make the test pass.
4. Run the test. Confirm it passes.
5. Run the full test suite. Confirm no regressions.
6. Commit your changes with a descriptive message.

Do NOT implement anything beyond what this single step requires.
Do NOT modify any file that is not directly related to this step.
Do NOT refactor code you were not asked to refactor.`;
}

The function loadProjectDirectives() is where the deepest bespoke work lives. This function loads the project's conventions, anti-patterns, and standards and injects them into every prompt. The content of those directives is what separates a pipeline that produces mergeable code from a pipeline that produces plausible code.

3.9 Project-Specific Directives

Project directives are the encoded knowledge of the team. They represent everything the team has learned about how to write good code for this specific project. A few illustrative examples:

Convention directives:

- Functions in convex/ use camelCase. Test files use describe/it blocks.
- Component tests mock useMutation from convex/react.
- API route handlers follow the pattern: validate → authorize → execute → respond.
- Seed data helpers live in tests/seeds/ and export factory functions.
- Page Object Models live in tests/pages/ and use the pattern: one class per page,
  methods return Promises, selectors use data-testid attributes.

Anti-pattern directives:

FAILURE: The agent created 6 separate handler files that were 80% identical.
CORRECT: One generic handler with 6 configuration objects.

FAILURE: The agent used getByText("Submit") which breaks under i18n.
CORRECT: Use getByTestId("submit-button") for interactive elements.

FAILURE: The agent mocked the database in integration tests. The mock passed
but the production migration failed because the mock schema diverged.
CORRECT: Integration tests must hit a real database, not mocks.

Boundary directives:

- Do NOT create new utility files. If you need a utility, add it to lib/utils.ts.
- Do NOT add new dependencies without explicit approval in the manifest.
- Do NOT modify the auth middleware. It is maintained by the security team.
- Do NOT use any selector strategy other than data-testid for E2E tests.

These directives are the most valuable part of the pipeline. They are also the hardest to write, because they require deep knowledge of the project. A team cannot shortcut this work by using a generalized directive set. The directives must come from the project's actual history of successes and failures.

This is one of the fundamental reasons bespoke pipelines outperform generalized agents. The generalized agent has none of this knowledge. It must infer conventions from the codebase (imperfectly), guess at anti-patterns (incorrectly), and discover boundaries (by violating them). The bespoke pipeline starts with all of this knowledge and enforces it structurally.

4. Why Bespoke Pipelines Produce Exponential Results

The performance gap between bespoke pipelines and generalized agent systems is not linear. It is exponential. This section explains why.

4.1 Compounding Knowledge

Every bespoke pipeline run teaches the team something. A failed output reveals a missing directive. A weak test reveals a missing verification rule. A convention violation reveals a missing anti-pattern example. Each lesson is encoded into the pipeline — into the prompts, the verification logic, or the decomposition strategy.

This means the pipeline gets better over time. Not gradually better. Compoundingly better. Each improvement prevents a class of future failures. After 10 pipeline runs, the directive set is noticeably stronger. After 50 runs, the pipeline encodes dozens of lessons that a generalized agent would need to rediscover on every single task. After 100 runs, the pipeline knows the codebase's failure modes, conventions, and architectural boundaries with a depth that no generalized system can match.

A generalized agent, by contrast, starts from zero every time. It may retain some context within a session, but it does not accumulate project-specific knowledge across sessions in the way that a bespoke pipeline does. The pipeline's directives, verification rules, and manifest templates are persistent. They compound.

4.2 Eliminated Rediscovery

A generalized agent must spend significant time at the beginning of every task reading the codebase, inferring conventions, and building a mental model of the system. This rediscovery work is pure overhead. It adds time, consumes context window, and introduces error.

A bespoke pipeline eliminates this overhead entirely. The conventions are injected into the prompt. The architectural boundaries are encoded in the manifest template. The anti-patterns are listed explicitly. The agent starts working immediately, with full context, instead of spending the first 20% of its context window figuring out how the project works.

In our measurements, this rediscovery overhead accounts for 15-30% of a generalized agent's total work on any given task. For a bespoke pipeline, it is effectively zero. Over hundreds of tasks, this difference compounds into an enormous productivity gap.

4.3 Higher First-Pass Quality

Because a bespoke pipeline's prompts encode project-specific conventions and anti-patterns, the first-pass output is much closer to the project's standard. This means fewer review rounds, fewer revision cycles, and less human intervention.

In our client engagements, generalized agent systems typically require 2-4 rounds of human feedback before producing mergeable output. Bespoke pipelines typically require 0-1 rounds. The difference is not that bespoke pipelines produce perfect output. It is that they produce output that is already aligned with the project's conventions, so the corrections are minor rather than structural.

4.4 Parallelism Across Informed Agents

When a bespoke pipeline runs multiple agents in parallel, each agent inherits the full set of project-specific directives. This means parallel work is consistent. Five agents working on five independent tasks will all follow the same conventions, avoid the same anti-patterns, and produce output that is stylistically coherent.

When generalized agents run in parallel, each agent independently discovers (or fails to discover) the project's conventions. The result is inconsistent output that requires significant post-hoc harmonization. In our experience, the harmonization cost often negates the throughput gain from parallelism.

4.5 Structural Verification

A bespoke pipeline's verification logic is not generic "did the tests pass?" It is project-specific "does the output meet our standards?" This means the pipeline catches problems that a generalized agent would miss entirely: stub tests, convention violations, duplication, missing test IDs, incorrect seed data patterns, and improper selector strategies.

Each verification rule eliminates a class of defects from the pipeline's output. Over time, the verification layer becomes a comprehensive quality gate that enforces the team's standards mechanically. This is qualitatively different from relying on an agent to self-assess its own output.

4.6 The Compounding Formula

These five effects — compounding knowledge, eliminated rediscovery, higher first-pass quality, informed parallelism, and structural verification — do not simply add together. They multiply.

A pipeline with 50 encoded anti-patterns (compounding knowledge) produces output that requires minimal revision (higher first-pass quality) and can be verified automatically (structural verification), enabling parallel execution across multiple instances (informed parallelism) without rediscovery overhead (eliminated rediscovery). Each mechanism makes the others more effective.

This is why the performance curve bends exponentially rather than linearly. It is not one optimization. It is a system of reinforcing optimizations whose interaction produces more than any individual optimization would produce alone.

5. Designing a Bespoke Pipeline

This section provides a practical framework for designing a bespoke pipeline. It does not prescribe a specific pipeline. It describes the decisions that must be made and the knowledge that must be gathered to design one.

5.1 Start with the Codebase Audit

Before designing a pipeline, you must understand the codebase it will operate on. This audit covers several dimensions:

Architecture. What is the overall structure? Monolith, monorepo, microservices? What frameworks are in use? Where are the module boundaries? Which modules are tightly coupled?

Conventions. How are files named? How are functions structured? What patterns does the codebase use for routing, data access, authentication, and error handling? Are these conventions documented or must they be inferred?

Test infrastructure. What test frameworks are in use? Where do tests live? How are they organized? What seed data patterns exist? How are test environments provisioned?

Known failure modes. What has gone wrong in the past? What anti-patterns have been identified? What areas of the codebase are fragile? What kinds of changes tend to cause regressions?

Organizational constraints. What can the team modify? What is off-limits? What requires approval from another team? What are the deployment and merge processes?

This audit produces the raw material for the pipeline's bespoke components: the project directives, the decomposition strategy, the verification rules, and the role permissions.

5.2 Define the Pipeline Shape

Based on the audit, define the pipeline's stages and the flow between them. Common shapes include:

TDD Pipeline:

PLAN → [RED → GREEN → REFACTOR] × batch → REVIEW → [REPLAN?] → VERIFY

Test Generation Pipeline:

RESEARCH → PLAN → [GENERATE → VERIFY] × batch → QUALITY AUDIT → FINAL VERIFY

Maintenance Pipeline:

INTAKE (commit/PR) → IMPACT ANALYSIS → RUN SUITE → CLASSIFY → FIX → GAP ANALYSIS → GENERATE → VERIFY

Audit Pipeline:

SCAN → CLASSIFY → DUPLICATION CHECK → FLAKE ANALYSIS → REPORT

Feature Pipeline:

RESEARCH → PRODUCT DOCS → RISK ASSESSMENT → TEST LAYER DECISION → TESTABILITY CHANGES → BATCH GENERATION → VERIFY → REVIEW

The shape should match the work. A pipeline that tries to do everything (generate tests, fix bugs, refactor code, review PRs, update documentation) will do all of them poorly. Build focused pipelines for specific activities.

5.3 Write the Project Directives

This is the most important step and the most time-intensive. Project directives should cover:

  1. Architecture description. A concise overview of the system that the agent can reference: tech stack, module boundaries, data flow, auth model.

  2. Naming conventions. File names, function names, variable names, test names, branch names. Be explicit. Examples are worth more than rules.

  3. Testing conventions. Test structure, assertion patterns, fixture management, selector strategy, Page Object Model conventions.

  4. Anti-pattern catalog. Real examples of past failures with the correct alternative. Each entry should include what the agent did wrong, why it was wrong, and what it should do instead.

  5. Boundary rules. Files and modules the agent must not modify. Dependencies it must not add. Patterns it must not introduce.

  6. Output format rules. Where to write results, what format to use, how to structure commits and pull requests.

The directive set is a living document. It will be sparse at first and grow rapidly as the pipeline runs. Every bad output is an opportunity to add a directive that prevents that class of failure in the future.

5.4 Design the Verification Layer

The verification layer defines what "correct" means for this project. It should include:

Functional verification. Run the tests. Check the exit codes. For TDD pipelines, confirm that RED tests fail and GREEN tests pass. For maintenance pipelines, confirm that the full suite passes after changes.

Convention verification. Scan the output for convention violations: wrong naming, wrong file locations, wrong test structure, wrong selector strategy.

Quality verification. Detect stub tests, trivial assertions, and weak patterns. Check for duplication against the existing codebase.

Structural verification. Confirm that the output matches the expected shape: correct number of files, correct module boundaries, correct import patterns.

Each verification check should produce a clear pass/fail result with a message that can be fed back to the agent if the check fails. This creates a feedback loop: the agent's failure becomes input to a revision stage, which produces corrected output that is verified again.

5.5 Design the Feedback and Revision Loop

A single pass is rarely sufficient for important work. The pipeline should include a review stage that audits the agent's output and a revision stage that incorporates feedback.

Review agent inspects the work
  → APPROVED: move to verification
  → FEEDBACK: pass feedback to a fix agent
      → Fix agent makes changes
      → Re-review (up to N rounds)

The review criteria are bespoke. They should reflect the project's standards:

  • Does the code follow project conventions?
  • Is there duplication with existing code?
  • Are tests meaningful (not stubs)?
  • Are boundary rules respected?
  • Are anti-patterns avoided?

Three review rounds is a practical maximum. If the output has not converged after three rounds, something is structurally wrong — either the directives are insufficient or the task decomposition is too coarse.

5.6 Design the Decomposition Strategy

Large tasks must be decomposed into steps that can be executed, verified, and reviewed independently. The decomposition strategy determines how work is divided.

This is deeply project-specific. In one codebase, the natural decomposition unit might be "one React component and its tests." In another, it might be "one API endpoint and its contract test." In another, it might be "one database migration and its integration tests."

Good decomposition has three properties:

  1. Each step is independently verifiable. You can run the tests for step 3 without needing to complete step 4 first.
  2. Each step is small enough for a single agent session. If a step requires more than ~50 lines of new code, it is probably too large.
  3. Dependencies between steps are explicit. If step 5 depends on step 3, that dependency is recorded in the manifest.

Bad decomposition creates steps that are too large, too coupled, or too vague. "Implement the user management system" is too large. "Implement the user creation API endpoint, the validation logic, the database migration, and the E2E test" is too coupled. "Improve the user experience" is too vague.

6. The Organizational Requirement

Building a bespoke pipeline requires more than technical skill. It requires organizational commitment. This section describes the organizational conditions that make bespoke pipeline development practical.

6.1 The Pipeline Is a Product

A bespoke pipeline is not a script you write once and run forever. It is a product that evolves with the codebase. It requires ongoing investment in directive refinement, verification rule updates, and decomposition strategy adjustments. Teams that treat the pipeline as a one-time setup cost will see its effectiveness degrade as the codebase evolves away from the pipeline's encoded assumptions.

In our experience, the pipeline requires approximately 10-15% of ongoing engineering time devoted to maintenance and improvement. This investment is modest relative to the productivity gains, but it is not zero. Teams that do not make this investment eventually find that their pipeline produces output that no longer matches the current state of the codebase.

6.2 The Pipeline Builder Is a Senior Engineer

Building a bespoke pipeline is senior engineering work. It requires deep understanding of the codebase, strong opinions about quality standards, and the ability to translate tacit engineering knowledge into explicit directives and verification rules.

This is not work that can be delegated to a junior engineer or outsourced to a vendor who does not know the codebase. The person building the pipeline must be the person who knows the codebase well enough to answer questions like: "What are the three most common mistakes an agent makes in this codebase?" and "What does a good pull request look like for this project?"

6.3 Access and Permissions

Bespoke pipeline development requires the same access and permissions described in our methodology paper. The pipeline builder needs:

  • Full access to all relevant repositories
  • Ability to run the application locally
  • Ability to provision ephemeral environments
  • Access to logs at all three layers (application, test framework, agent)
  • Permission to make testability changes to the application code
  • A high-capability model configuration with sufficient throughput

Without these prerequisites, the pipeline cannot be tested, debugged, or validated. Building a bespoke pipeline in a restricted environment produces a pipeline that works in the restricted environment — which is to say, it produces mediocre results.

6.4 The Learning Curve

The first bespoke pipeline for a new project takes significantly longer to build than subsequent pipelines. This is expected. The first pipeline is when the team discovers the project's conventions, identifies its anti-patterns, designs its verification rules, and writes its initial directive set. That discovery work has a high upfront cost.

The second pipeline is faster because it can reuse the structural components from the first pipeline and the directive set that was developed during the first pipeline's runs. The third pipeline is faster still. By the fourth or fifth pipeline, the team has a robust set of reusable components and a mature directive set that new pipelines can inherit.

This learning curve is one of the strongest arguments for bespoke pipeline development. The knowledge accumulated during pipeline construction is an asset that compounds over time. A generalized agent system does not accumulate this knowledge. It starts from the same baseline every time.

7. Common Pipeline Shapes and Their Bespoke Requirements

Different engineering activities require different pipeline shapes, and each shape has different bespoke requirements. This section describes five common shapes and the customization points that matter most for each.

7.1 The TDD Pipeline

Shape:

PLAN → [RED → GREEN → REFACTOR] × batch → REVIEW → [REPLAN?] → VERIFY

Bespoke requirements:

  • The planner must understand the feature well enough to decompose it into independently testable behaviors
  • The RED phase verification must confirm that the test fails for the right reason, not just that it fails
  • The GREEN phase verification must confirm that the implementation is minimal, not over-engineered
  • The review criteria must include project-specific code style, test quality, and convention compliance
  • The refactor phase must be bounded — the agent should improve the code it just wrote, not refactor the entire module

7.2 The Maintenance Pipeline

Shape:

INTAKE → IMPACT ANALYSIS → RUN SUITE → CLASSIFY → FIX → GAP ANALYSIS → GENERATE → VERIFY

Bespoke requirements:

  • The impact analysis stage must understand the project's module boundaries and dependency graph
  • The classification logic must distinguish between test failures caused by the change, pre-existing flakiness, and environment issues
  • The fix stage must know the project's conventions for updating tests (not just making them pass, but making them correct)
  • The gap analysis must know the project's risk model — which areas need coverage and which do not

7.3 The Audit Pipeline

Shape:

SCAN → CLASSIFY → DUPLICATION CHECK → FLAKE ANALYSIS → REPORT

Bespoke requirements:

  • The scan stage must know where tests live and how they are organized in this project
  • The classification criteria must reflect the project's quality standards
  • The duplication check must account for the project's legitimate patterns (some repetition is intentional)
  • The flake analysis must distinguish between flakiness sources: timing, shared state, selectors, or environment

7.4 The Feature Pipeline

Shape:

RESEARCH → PRODUCT DOCS → RISK ASSESSMENT → TEST LAYER DECISION → TESTABILITY CHANGES → BATCH GENERATION → VERIFY → REVIEW

Bespoke requirements:

  • The research stage must know where to look for feature context (tickets, PRs, design docs, product specs)
  • The risk assessment must reflect the project's specific risk model
  • The test layer decision must be informed by the project's existing test strategy (what is already covered at each layer)
  • The testability changes must follow the project's conventions for test IDs, seed data, and mocking

7.5 The Migration Pipeline

Shape:

ANALYZE → PLAN → TRANSFORM → VERIFY → RECONCILE → VERIFY → REVIEW

Bespoke requirements:

  • The analysis stage must understand both the source and target patterns
  • The transformation logic must preserve semantic behavior while changing structural patterns
  • The verification must confirm both that the new patterns are correct and that no existing behavior was broken
  • The reconciliation stage must handle edge cases where the transformation did not apply cleanly

8. Templates Versus Bespoke: What Is Actually Reusable

We do not argue that every pipeline must be built from scratch. We argue that the most important parts of every pipeline are the parts that cannot be templated. This section clarifies what can and cannot be reused.

8.1 What Can Be Templated

The following components can be packaged as starter templates and reused across projects with minimal modification:

  • Agent invocation layer (subprocess spawning, timeout handling, retry logic)
  • Worktree isolation (branch creation, symlink management, cleanup)
  • Session artifact management (directory creation, symlink to current session, file layout)
  • Parallel execution engine (worker pool, concurrency limiting, result collection)
  • Manifest schema (step definition, status tracking, dependency resolution)
  • Review loop structure (multi-round review with feedback forwarding)
  • Health check framework (type checking, full suite execution, application boot verification)
  • External tracking integration (structured updates to external systems)

These templates provide perhaps 30% of the total pipeline. They are the chassis. They are necessary but not sufficient.

8.2 What Cannot Be Templated

The following components must be written for each project:

  • Project directives. The conventions, anti-patterns, and boundary rules that define what "good code" looks like for this project. These cannot be generalized because they are the definition of project-specific quality.

  • Verification rules. The specific checks that determine whether output meets the project's standards. Generic verification ("do the tests pass?") is insufficient. Project-specific verification ("are the tests meaningful? Do they follow our patterns? Are they testing the right thing?") is required.

  • Decomposition strategy. How work is divided into steps depends on the project's architecture, module boundaries, and risk profile. A decomposition strategy that works for a React SPA does not work for a microservice architecture.

  • Prompt content. The instructions given to each agent must include the project's specific context, conventions, and constraints. A prompt template can provide structure, but the content must be project-specific.

  • Integration points. How the pipeline connects to the project's CI/CD system, ticket tracker, test platform, and review process. These integrations are unique to each organization.

8.3 The 30/70 Rule

In our experience, approximately 30% of a bespoke pipeline is reusable infrastructure and approximately 70% is project-specific customization. Teams that invest most of their effort in the 30% (building robust infrastructure) and neglect the 70% (writing good directives, verification rules, and decomposition strategies) end up with a pipeline that runs reliably but produces mediocre output.

The highest-performing teams do the opposite. They use minimal infrastructure — sometimes rough, sometimes held together with simple scripts — and invest heavily in the 70%: the directives, the anti-pattern catalog, the verification rules, and the decomposition knowledge. Their pipelines may not be the most elegant TypeScript programs, but they produce the best engineering output.

This is the central practical insight of this paper. The quality of a pipeline is determined by how well it encodes project-specific knowledge, not by how sophisticated its infrastructure is.

9. The Evolution of a Bespoke Pipeline

A bespoke pipeline is not static. It evolves through a predictable lifecycle that mirrors the team's growing understanding of the codebase and the pipeline's capabilities.

9.1 Phase 1: Discovery (Weeks 1-2)

The pipeline is simple — perhaps just an agent invocation layer with basic worktree isolation and a few project directives. Most of the work is discovering what directives the pipeline needs. Every run produces failures that reveal missing instructions, unknown conventions, or unanticipated anti-patterns.

The directive set grows rapidly during this phase. The team may add 5-10 new directives per week as the pipeline encounters the codebase's idiosyncrasies.

9.2 Phase 2: Stabilization (Weeks 3-6)

The pipeline's output quality improves noticeably. The most common failure modes have been addressed by directives. Verification rules catch the remaining issues before they reach human review. The team begins to trust the pipeline's output enough to reduce manual review overhead.

New directives are still being added, but at a slower rate. The focus shifts from "what is the pipeline getting wrong?" to "how can the pipeline handle more complex tasks?"

9.3 Phase 3: Maturity (Weeks 7-12)

The pipeline produces output that is consistently close to the team's quality standards. Human review is focused on subtle issues rather than obvious problems. The directive set is comprehensive enough that most tasks complete with zero or one review rounds.

At this point, the pipeline begins to be more productive than manual engineering for the tasks it was designed to perform. The crossover point — where the pipeline produces better output faster than a human engineer working alone — typically occurs somewhere in this phase.

9.4 Phase 4: Specialization (Ongoing)

The team builds additional pipelines for different activities: maintenance, audit, feature coverage, migration. Each new pipeline inherits the directive set from existing pipelines and adds its own activity-specific customizations.

The pipeline ecosystem becomes a significant engineering asset. It encodes the team's knowledge, standards, and failure lessons in a form that is executable, versionable, and compounding.

10. Limitations and Open Questions

10.1 The Upfront Investment Is Real

Building a bespoke pipeline takes time. The first pipeline for a new project requires 2-4 weeks of sustained effort from a senior engineer. Teams that need results immediately may not have the patience for this investment, even though the long-term return is substantial.

10.2 The Pipeline Builder Must Be Excellent

A pipeline is only as good as the directives it contains, and the directives are only as good as the engineer who writes them. Teams that do not have a senior engineer with deep codebase knowledge and strong opinions about quality will struggle to build effective pipelines.

10.3 Rapidly Changing Codebases Require More Maintenance

If the codebase is changing so rapidly that conventions shift weekly, the pipeline's directives may fall behind. The maintenance cost scales with the rate of change in the codebase.

10.4 Not Every Task Warrants a Pipeline

Small, one-off tasks — fixing a typo, answering a quick question, making a single targeted change — do not benefit from pipeline infrastructure. The overhead of pipeline setup, artifact creation, and verification is not justified for work that can be completed in five minutes with a direct agent session.

10.5 We Have Limited Data on Non-Web Codebases

Our experience is concentrated in web applications with React frontends and API backends. We expect the principles to apply to other domains — embedded systems, mobile applications, data pipelines — but the specific pipeline designs would need to adapt significantly. We do not yet have strong evidence for how bespoke pipelines perform in those contexts.

10.6 Model Capabilities May Change the Equation

As models become more capable, the gap between generalized agents and bespoke pipelines may narrow. Models with better long-term memory, stronger convention inference, and more accurate pattern matching may require less project-specific instruction. However, we do not expect the gap to close entirely. The fundamental advantage of a bespoke pipeline is that it starts with project-specific knowledge rather than inferring it. Better models will make inference faster and more accurate, but encoding knowledge directly will always be faster and more reliable than inferring it.

11. Conclusion

The pursuit of the god factory — one generalized agent system that solves all software engineering problems — is a seductive vision that does not match the reality of professional software development. Real codebases are specific. They have specific conventions, specific failure modes, specific architectural boundaries, and specific quality standards. A system that ignores this specificity in favor of generality will always underperform a system that embraces it.

Bespoke agentic pipelines are the practical alternative. They combine reusable structural components — agent invocation, worktree isolation, manifest-driven decomposition, role-based permissions, review loops — with deeply project-specific customization: directives that encode conventions, verification rules that enforce standards, decomposition strategies that respect architectural boundaries, and anti-pattern catalogs that prevent known failure modes.

The results are not marginal. In our experience across more than 30 client engagements, bespoke pipelines consistently produce 10-20x more useful output than generalized agent systems applied to the same codebase. The gains come not from better models but from better pipeline design — design that encodes the knowledge that a generalized system must discover, imperfectly, every time.

The analogy to manufacturing is apt. Ford builds different vehicles in different factories not because Ford lacks the talent to design a universal factory, but because specialized factories produce better vehicles. The same logic applies to software engineering. A pipeline designed for your codebase will always outperform a generalized agent that was designed for every codebase.

The practical implication is straightforward. If you are investing in AI-assisted software engineering, invest in pipeline design. Build the infrastructure once. Write the directives for your project. Encode your anti-patterns. Design your verification rules. Refine your decomposition strategy. The pipeline will be rough at first. It will improve with every run. And within weeks, it will produce better engineering output than any generalized system can.

The pipeline becomes bespoke quickly. That is not a failure. That is the methodology working as designed.


Appendix A: Pipeline Design Template

Use this template when designing a new bespoke pipeline.

Pipeline Name:
Purpose:
Target Codebase:
Trigger: (manual, commit-driven, ticket-driven, scheduled)
Inputs: (ticket ID, commit hash, feature description, etc.)
Repositories Touched:
Required Permissions:

Agent Roles:
  - Role 1: (name, purpose, allowed tools)
  - Role 2: (name, purpose, allowed tools)
  - ...

Stages:
  1. (stage name, agent role, input, output)
  2. (stage name, agent role, input, output)
  3. ...

Verification Steps:
  - (what is checked, how it is checked, what happens on failure)

Project Directives Location:
  - (path to directive files)

Decomposition Strategy:
  - (how work is divided into steps)
  - (what constitutes one step)
  - (maximum step size)
  - (dependency rules)

Artifacts Produced:
  - (manifests, research notes, review results, reports)

External Systems Updated:
  - (tickets, test platforms, dashboards, PRs)

Failure / Retry Strategy:
  - (what is retried, how many times, what triggers escalation)

Manual Review Checkpoint:
  - (where does a human inspect the output)

Branching Strategy:
  - (worktree, branch naming, commit conventions, merge path)

Appendix B: Directive Writing Checklist

Use this checklist when writing project directives for a new pipeline.

  • Architecture overview (tech stack, module boundaries, data flow)
  • File naming conventions (with examples)
  • Function naming conventions (with examples)
  • Test file naming and organization conventions
  • Test structure conventions (describe/it patterns, assertion style)
  • Selector strategy (when to use roles, labels, text, test IDs)
  • Seed data patterns (factory functions, fixture files, cleanup strategy)
  • Page Object Model conventions (if applicable)
  • Authentication and authorization patterns in test environments
  • Environment configuration patterns
  • Import and dependency rules
  • At least 5 anti-pattern entries with concrete FAILURE/CORRECT examples
  • Boundary rules (files not to modify, patterns not to introduce)
  • Output format rules (where to write, how to structure commits)
  • PR description template

Appendix C: Pipeline Maturity Assessment

Use this to evaluate the maturity of an existing bespoke pipeline.

Infrastructure (reusable components)

  • Agent invocation with timeout and retry handling
  • Worktree isolation for every run
  • Session artifact directory with consistent layout
  • Manifest-based decomposition with status tracking
  • Role-based tool permissions enforced per stage
  • Review and revision loop (up to 3 rounds)
  • Parallel execution for independent tasks
  • Health checks (type check, full suite, application boot)
  • External system updates (tickets, dashboards)

Bespoke components (project-specific)

  • Architecture directives written and current
  • Convention directives with concrete examples (10+ entries)
  • Anti-pattern catalog with FAILURE/CORRECT examples (5+ entries)
  • Boundary rules documented and enforced
  • Verification rules beyond "do tests pass" (3+ custom checks)
  • Decomposition strategy documented and tested
  • Prompt templates include all six layers (role, feature, manifest, prior work, directives, output instructions)
  • Directives updated within the last 2 weeks

Scoring

  • Infrastructure: 7+ checked = solid foundation
  • Bespoke: 6+ checked = mature pipeline producing high-quality output
  • Bespoke: 3-5 checked = developing pipeline, expect inconsistent output
  • Bespoke: 0-2 checked = the pipeline is infrastructure without substance — prioritize directive and verification work

Appendix D: God Factory vs. Bespoke Pipeline Comparison

Dimension God Factory (Generalized Agent) Bespoke Pipeline
Project knowledge Inferred per session Encoded in directives
Convention compliance Best-effort guessing Structurally enforced
Anti-pattern avoidance Unknown to the agent Explicitly listed and prevented
Verification Self-assessed by the agent Independently verified by the pipeline
Decomposition Generic heuristics Project-specific architectural boundaries
Parallelism quality Inconsistent (each agent rediscovers conventions) Consistent (all agents inherit directives)
Learning over time None (starts from zero each session) Compounding (directives grow with every run)
First-pass quality Variable, often requires 2-4 review rounds Consistent, typically 0-1 review rounds
Setup cost Low (install and go) Moderate (2-4 weeks for first pipeline)
Long-term ROI Linear with model improvements Exponential with directive maturity
Ceiling Limited by what the model can infer Limited by what the team can encode

Appendix E: Common Anti-Pattern Template

Use this format when adding entries to your project's anti-pattern catalog.

### Anti-Pattern: [descriptive name]

**FAILURE:** [What the agent did wrong. Be specific. Include the actual bad output
if possible — file names, code patterns, structural decisions.]

**Why it's wrong:** [Why this is a problem in THIS project specifically.
Reference the specific constraint, convention, or past incident that makes
this pattern unacceptable.]

**CORRECT:** [What the agent should do instead. Be equally specific.
Show the correct pattern with enough detail that the agent can follow it.]

**Detection:** [How the verification layer can catch this automatically,
if applicable. Reference the specific check or regex.]

Example:

### Anti-Pattern: Duplicate Handler Files

FAILURE: The agent created 6 separate handler files (handleCreate.ts,
handleUpdate.ts, handleDelete.ts, handleList.ts, handleGet.ts, handleBatch.ts)
that were 80% identical, differing only in the HTTP method and validation schema.

Why it's wrong: This project uses a generic handler pattern with config objects.
Duplicate handlers create a maintenance burden — when the error handling format
changes, all 6 files must be updated instead of one.

CORRECT: One generic handler (lib/api/handler.ts) with a config map:
  const routes = { create: { method: 'POST', schema: createSchema }, ... }

Detection: Glob for files matching **/handle*.ts in the same directory.
If count > 2 in one module, flag for review.

Next step

Apply this to your team.

Reading the methodology is the first step. Working with us to implement it inside your QA function is the next.

Template

90-Day QA Leverage Plan

Coming soon