Abstract
AI-assisted software development has a deterministic floor problem. Large language models are now competent enough to produce code that looks correct, follows most conventions, and passes most tests — and just wrong enough, just often enough, that every serious team using them has had to invent some mechanism for catching the failure modes that the model cannot catch itself. Retrieval-augmented generation does not solve this. Longer context windows do not solve this. Better prompts do not solve this. What solves it is a layer of rules the model cannot override, expressed as code the team maintains, that encode the invariants the codebase must not violate regardless of what any agent — human or AI — happens to write.
This paper argues that Policy as Code is the missing layer beneath modern agentic development. Drawing on our work at LoopQA building wixor_policy and its predecessors across more than 30 client engagements, we present Policy as Code as a complementary system to the non-deterministic rules that already govern AI agents: Claude skills, memory, slash commands, system prompts, CLAUDE.md files, subagent configurations, and the accumulated directive sets described in our earlier work. The two systems are not alternatives. They are complements. AI rules describe behavior in natural language and trust the model to follow them. Policy as Code describes behavior in executable terms and does not trust anything — it simply runs.
In the environments we have measured, the combination of Policy as Code with AI rules consistently achieves 80 to 90 percent accuracy on the invariants the team cares about — a ceiling that neither system reaches on its own. AI rules by themselves top out around 60 to 75 percent depending on the codebase. Pure linting and type-checking without AI collaboration tops out around 40 to 55 percent on the invariants that actually matter for production quality. The combination produces compounding returns because the two systems catch different classes of failures.
We also introduce a measurement framework for policy drift — the slow divergence between what a policy says the codebase is and what the codebase actually is. Drift is the single largest failure mode for Policy as Code systems. A policy that describes the codebase as it existed six months ago is worse than no policy at all: it produces false positives that erode trust, it produces false negatives that mask real problems, and it poisons the feedback loop that AI agents rely on to learn the project's conventions. We describe four drift metrics — evidence freshness, rule hit rate, waiver age, and convention coverage — and present patterns for keeping drift bounded.
This paper has three aims. The first is to formalize the theory of Policy as Code as a distinct discipline, separate from linting, type checking, and general static analysis. The second is to describe the implementation patterns that have produced 80-90 percent accuracy across our engagements: the rule contract, the evidence hierarchy, the finding schema, the gate model, the waiver system, and the evidence cache. The third is to introduce drift as a first-class concern and to provide a measurement framework that teams can use to keep their policy systems current.
1. Introduction
There is a category of engineering problem that linters do not catch, type systems do not catch, tests do not catch, and AI review agents catch only some of the time. These are the problems that require a specific, concrete, project-level rule — "this module must expose a run/1 callback," "this package must register in the capability manifest," "every LiveView must declare a data-testid on its root element" — enforced with the precision and repeatability of a script and the domain specificity of a human reviewer.
Over the last two years, LoopQA has worked with more than 30 client teams using AI agents for software development, test automation, and quality engineering in production environments. In every single engagement that reached maturity, the team eventually built some form of Policy as Code: a set of executable rules, version-controlled alongside the codebase, that checked invariants the team cared about and failed the build when those invariants were violated. The rules were not linting rules. They were not type definitions. They were not tests. They were something else — policy — and no standard tool covered them.
Teams that took this discipline seriously reported consistent outcomes. AI agents produced code that passed policy more often over time. Human reviewers spent less time catching the same class of mistake repeatedly. The codebase's invariants were enforceable rather than aspirational. And — critically — the team's AI directives got shorter, not longer, because the rules that could be expressed as code no longer needed to be repeated in every prompt.
Teams that did not take this discipline seriously drifted. Their AI directives grew to enormous lengths as they tried to encode every constraint in natural language. Their code review comments began to repeat themselves — the same mistakes, caught by the same reviewers, fixed by the same engineers, only to return in the next pull request. Their codebases developed "conventions" that only the most senior engineers could explain, because the conventions had never been written down in a form anyone could execute.
The difference was not model quality. The teams using the same models got different results. The difference was whether the team had built a deterministic layer beneath the non-deterministic one.
This paper has three aims. The first is to describe Policy as Code as a distinct practice, with its own theory, its own implementation patterns, and its own failure modes. The second is to explain why Policy as Code and AI rules are complements rather than substitutes — and why teams that rely on either alone hit a ceiling that teams combining both do not. The third is to introduce drift as a first-class concern, with a measurement framework teams can adopt to keep their policy systems current.
We argue that the central productivity insight of the AI-assisted era is not "better models" or "better prompts" or "better agents." It is architectural. Agentic software development is most effective when a deterministic layer and a non-deterministic layer work together — each doing what the other cannot. The deterministic layer is Policy as Code. The non-deterministic layer is the set of AI rules, skills, and memory systems that shape agent behavior. Neither replaces the other. Both are necessary.
2. The Deterministic Floor Problem
Modern AI coding assistants are fluent. They produce code that compiles, runs, passes type checks, and often passes the team's tests on the first try. What they do not reliably produce is code that obeys the project's specific conventions — and every mature codebase has hundreds of such conventions. This is the deterministic floor problem: the layer of project-specific invariants that exists beneath the surface of what any general-purpose quality tool can detect.
2.1 What General-Purpose Tools Miss
Standard quality tooling catches a predictable subset of defects. A linter catches syntactic misuse. A type checker catches structural misuse. A test suite catches behavioral regressions. These tools are necessary. They are also insufficient for the class of invariants that actually distinguish a mature codebase from a fresh one.
Consider a short, non-exhaustive list of invariants we have seen in real client codebases that no standard tool catches:
- Every module in the
platform/directory must declare@behaviour PlatformComponent. - Every Phoenix LiveView must have a
data-testidattribute on its root element. - Every API handler must include a
@specmatching the declared OpenAPI contract. - Every migration must include both an
up/0and adown/0function. - Every feature package must register in the central capability manifest.
- No file outside of
lib/auth/may importAuthInternal. - Every test in
test/integration/must use theDataCasemacro, not the baseTestCase. - Every LiveView that accepts a URL parameter must validate the parameter in
handle_params/3. - Every component must export both
component_nameandcomponent_name_testsfrom the same package.
None of these are linting rules. Linters do not understand behaviours, LiveView attributes, or capability manifests. None of them are type errors. The type checker will happily accept code that violates every single one of these invariants, because the invariants are not about types — they are about structural presence, cross-module constraints, and project-specific patterns.
Tests catch behavioral regressions, not structural violations. A missing data-testid is not a behavioral bug — the LiveView renders correctly, it just is not testable. A missing @behaviour declaration is not a behavioral bug — the module works, it just does not advertise its contract. These violations do not fail the test suite. They fail the codebase, slowly, by eroding its invariants until nothing can be trusted.
2.2 Why AI Rules Alone Are Insufficient
A natural first reaction is: "Fine, encode these rules in the AI system's prompt or skills." This works — partially. AI rules are very good at shaping agent behavior on the happy path. They are less good at catching edge cases, and they are terrible at catching violations that originate from somewhere other than the AI agent itself.
We have measured three recurring failure modes for AI-only rule enforcement.
Prompt decay under context pressure. As a session's context window fills, earlier instructions receive less weight than later ones. A rule placed at the beginning of a long system prompt — "every LiveView must have a data-testid" — may be respected on turn 1 and forgotten by turn 15. Rules at the top of a CLAUDE.md file are more durable than rules buried deep in a directive set, but even the best placement does not produce deterministic compliance.
Semantic ambiguity. Natural language rules are interpreted. "Avoid deep nesting" is interpreted. "Use descriptive names" is interpreted. Even concrete rules like "use data-testid for interactive elements" admit ambiguity: what counts as interactive? Is a disabled button interactive? Is a hover tooltip interactive? The AI's interpretation drifts from the team's interpretation in ways that are hard to notice in aggregate.
Non-AI-originated violations. When a human engineer writes a pull request, no AI rule catches the violation. When a contractor writes code, no AI rule catches the violation. When a bulk migration script modifies thousands of files, no AI rule catches the violation. AI rules only apply when an AI is driving. The codebase contains code from many sources, most of which never pass through an AI at all.
These are not failures of the AI system. They are failures of the assumption that a non-deterministic tool can provide deterministic guarantees. It cannot. Some layer beneath the AI must actually execute, actually check, and actually fail the build.
2.3 Why General Static Analysis Is Insufficient
The other natural reaction is: "Fine, use a generic static analysis tool — there are plenty of them." This also works, partially, and also misses the point. Generic static analyzers are designed to catch generic defects. They do not know about your capability manifest, your Page Object Model convention, your platform component behaviour, or your authentication boundary rules. They catch the things every codebase has in common. They do not catch the things your codebase has that other codebases do not.
The invariants that actually distinguish a mature codebase from an immature one are project-specific. They come from the team's accumulated experience, past incidents, architectural decisions, and deliberate conventions. Encoding them requires writing rules against the project's own abstractions, metadata, and structural surfaces — not against a universal ruleset designed for every project at once.
This is the same insight that drives bespoke agentic pipelines. Generic tools cannot know what your project knows. A system that enforces the invariants a mature team cares about must be written for that team, against that team's code, with that team's conventions encoded as executable rules.
3. Theory: What Policy as Code Actually Is
Policy as Code is a practice distinct from linting, type checking, testing, and general static analysis. This section defines the practice precisely, identifies its defining characteristics, and distinguishes it from adjacent disciplines.
3.1 Definition
Policy as Code is the practice of expressing a codebase's project-specific invariants as executable rules, version-controlled alongside the codebase, that produce structured findings when violations occur and that gate the build on configurable criteria.
Three properties define the practice:
Project specificity. Rules are written for the specific codebase, against the specific codebase's abstractions, conventions, and metadata. A rule that could be written for any project is probably a linting rule, not a policy rule.
Evidence-based execution. Rules are checked against structural evidence — compiled module introspection, AST analysis, framework metadata, declared manifests — not against text patterns. A rule that relies on regex as its source of truth is a heuristic, not a policy.
Deterministic outcome. The same rule applied to the same evidence produces the same finding every time. There is no probabilistic component. There is no model. There is only the rule and the evidence.
A system that violates any of these three properties is something other than Policy as Code. It may be valuable — linting is valuable, AI review is valuable — but it does not provide the deterministic floor that Policy as Code provides.
3.2 What Policy as Code Is Not
It is not linting. Linters check generic code quality: unused imports, missing semicolons, formatting inconsistencies. Policy rules check project-specific invariants: "every module must register in the capability manifest." A linter that has been extended with project-specific rules is a policy engine; a policy engine without project specificity is just a linter.
It is not type checking. Type systems check structural compatibility. They do not check cross-module conventions, required metadata, declared capabilities, or architectural boundaries. A type system that enforced every invariant we care about would be so complex and so project-specific that it would no longer be a type system.
It is not testing. Tests check behavior. Policy checks structure. A test can assert that the user creation API works; it cannot assert that the user creation API follows the project's handler pattern. A policy rule can. The two disciplines check different things and should not be conflated.
It is not AI review. AI review agents read code and produce natural-language feedback. Policy rules read evidence and produce structured findings. The two produce different outputs, have different failure modes, and belong in different positions in the pipeline.
It is not documentation. A document that describes the project's conventions is not executable. It can be ignored without consequence. Policy as Code is the executable form of the same information, with consequences for violation.
3.3 The Rule Contract
Every policy rule must answer four questions:
- What invariant does this rule enforce? Stated as a single sentence that could go in a design document.
- What evidence does this rule check? The specific source — compiled module, AST node, manifest entry — from which the rule derives its finding.
- What finding does this rule produce when violated? The structured output: severity, message, suggested fix, subject, file, line.
- What gate does this rule affect? Whether a violation blocks the build, warns the developer, or serves as informational signal.
A rule that cannot answer all four questions is not well-formed. The vast majority of rule-writing failures we have observed trace back to an unclear answer to one of these four questions. A rule that is unclear about the invariant produces false positives. A rule that is unclear about the evidence produces brittle checks. A rule that is unclear about the finding produces noise. A rule that is unclear about the gate produces inconsistent behavior under pressure.
3.4 The Evidence Hierarchy
Not all evidence is equal. Rules should draw from the highest-quality evidence available, falling back to lower-quality sources only when the invariant cannot be proven otherwise. We have settled on a five-level hierarchy.
Table 1. Evidence hierarchy for policy rule authoring
| Level | Evidence Source | Example | Reliability |
|---|---|---|---|
| 1 | Runtime and compile-time introspection | Code.fetch_docs/1, Code.Typespec.fetch_callbacks/1, function_exported?/3 |
Highest — reflects what the runtime actually sees |
| 2 | AST-based source analysis | Parsed syntax trees for attributes, decorators, macros, and structure | High — reflects what the compiler actually sees |
| 3 | Framework-specific reflection | Phoenix router tables, component declarations, LiveView metadata | High — reflects the framework's declared surface |
| 4 | Filesystem presence | Required files, directories, guide pages, migrations | Moderate — reflects declared ownership |
| 5 | Regex and text matching | String search of source files | Low — last resort, not a source of truth |
The principle is that a rule should rely on the highest level of evidence that can prove the invariant. A rule that could use Level 1 introspection but instead uses Level 5 regex is a brittle rule waiting to produce false positives.
This is one of the most important design decisions in a policy system. The hierarchy is not cosmetic. Rules written against compiled introspection are stable across refactors. Rules written against regex break the moment someone reformats a file, renames a variable, or splits a module. The difference is the difference between a policy system that lasts and one that collapses under maintenance pressure.
3.5 The Finding Contract
Every policy finding must be normalizable to a common schema. Without a normalized schema, reports cannot be aggregated, findings cannot be compared across rules, and tooling cannot be built on top of the policy engine.
defmodule WixorPolicy.Finding do
@type t :: %__MODULE__{
id: String.t(),
rule_id: String.t(),
domain: :contract | :app | :repo,
severity: :hard_fail | :release_gate | :warning | :advisory,
message: String.t(),
subject: String.t() | nil,
file: Path.t() | nil,
line: pos_integer() | nil,
column: pos_integer() | nil,
suggested_fix: String.t() | nil,
autofix_payload: map() | nil
}
end
Every finding carries a stable ID (for suppression and tracking), a rule ID (to trace back to the rule), a domain (which policy domain this finding belongs to), a severity (which gate it affects), and a message. Optional fields — subject, file, line, column, suggested fix, autofix payload — provide the context a developer needs to resolve the finding.
The finding contract is not an implementation detail. It is the API that every downstream tool consumes: CI systems, IDE plugins, AI review agents, dashboards, incident systems. A policy engine without a stable finding contract cannot participate in a broader quality ecosystem.
4. The Complementarity Principle
Policy as Code and AI rules are complements, not substitutes. The central claim of this paper is that the combination of the two systems reliably achieves 80 to 90 percent accuracy on the invariants a team cares about, while either system alone tops out well below that ceiling. This section explains why.
4.1 What AI Rules Are Good At
AI rules — Claude skills, memory entries, system prompts, CLAUDE.md directives, subagent configurations — are very good at a specific class of work:
- Shaping behavior on the happy path of a task
- Handling ambiguous cases that require judgment
- Producing natural-language output (code comments, PR descriptions, documentation)
- Connecting multiple rules into coherent decisions
- Responding to context that was not anticipated when the rules were written
These are things that deterministic rules struggle with. A policy rule cannot reason about whether a new component is more similar to pattern A or pattern B. An AI rule can. A policy rule cannot generate a descriptive commit message that reflects the intent of a change. An AI rule can. A policy rule cannot weigh whether a minor violation is worth flagging given the context of a larger refactor. An AI rule can.
4.2 What AI Rules Are Bad At
AI rules fail in predictable ways:
- They decay under context pressure
- They interpret rules rather than executing them
- They miss violations in code they did not write
- They produce inconsistent results across runs
- They cannot gate a build deterministically
- They cannot be audited for exhaustive coverage
- They degrade when the instruction set grows beyond a certain size
These are not model failures. They are structural properties of non-deterministic systems. No amount of prompt engineering converts an interpretive system into a deterministic one.
4.3 What Policy as Code Is Good At
Policy as Code is good at the things AI rules are bad at:
- Deterministic enforcement regardless of context pressure
- Exact execution of the written rule, with no interpretation
- Uniform application to all code, regardless of origin
- Identical results across runs
- Build gating with clear pass/fail semantics
- Auditable coverage via rule inventories
- Stable behavior as the rule set grows
A policy rule does not care how much context the pipeline has used. It does not interpret. It does not distinguish between AI-authored code and human-authored code. It runs. It produces a finding or does not.
4.4 What Policy as Code Is Bad At
Policy as Code is bad at the things AI rules are good at:
- Handling ambiguous cases that require judgment
- Generating natural-language output
- Connecting multiple weak signals into a decision
- Adapting to context that was not anticipated
- Expressing rules about intent rather than structure
A policy rule cannot decide whether a test assertion is "meaningful enough." It can detect trivially weak assertions (regex for toBeDefined), but it cannot judge whether a well-formed assertion actually covers the behavior the test claims to cover. That is an interpretive task. AI rules handle it.
4.5 The Combined System
The two systems cover each other's blind spots. AI rules handle interpretation, judgment, and novel context. Policy rules handle determinism, uniformity, and gating. A system that combines both covers a wider surface than either alone.
Table 2. Coverage by system across invariant categories
| Invariant Category | AI Rules Alone | Policy as Code Alone | Combined |
|---|---|---|---|
| Structural presence (module declares X) | 60–70% | 95–100% | 95–100% |
| Convention adherence (naming, file location) | 65–80% | 90–100% | 95–100% |
| Semantic quality (test is meaningful, not stub) | 70–85% | 20–40% | 85–95% |
| Cross-module constraints (A must not import B) | 40–60% | 95–100% | 95–100% |
| Intent-based rules (new feature follows pattern) | 70–85% | 30–50% | 85–95% |
| Manifest and metadata registration | 50–70% | 95–100% | 95–100% |
| Anti-pattern avoidance (known bad pattern) | 60–75% | 80–95% | 90–98% |
| Novel case handling (pattern not yet codified) | 60–75% | 0% | 60–75% |
| Aggregate across all categories | 60–75% | 50–70% | 80–90% |
The aggregate number is what matters. Neither system alone exceeds 75 percent coverage across all invariant categories that we have measured in production. The combination reliably reaches 80 to 90 percent. The difference is not marginal. It is the difference between a codebase that erodes under AI-assisted development and a codebase that hardens under it.
4.6 Why the Ceiling Exists
The 80-90 percent ceiling for the combined system is a real ceiling. Reaching it requires disciplined work. Exceeding it requires disciplined work that most teams will not do. Three factors bound the ceiling.
First, some invariants require human judgment no rule can capture. "This module should be decomposed because it is doing too much" is a legitimate finding that no policy rule can produce mechanically and no AI rule can produce reliably. Human review is the only reliable source.
Second, novel cases always exist. A rule can only cover cases the team has seen and codified. The first time a new pattern appears, no rule covers it. The team discovers it, codifies it, and raises the ceiling — but the next novel pattern is always waiting.
Third, drift always exists. Even the best policy system drifts as the codebase evolves. Bringing drift back to zero requires maintenance the team must prioritize. Teams that do not maintain their policy actively fall below 80 percent over time, regardless of how well the system was built initially.
4.7 Layered Architecture
In production, the two systems compose as layers in a pipeline.
Figure 1. Combined deterministic and non-deterministic quality system
flowchart TB
A[Developer or AI agent produces changes] --> B[Policy engine: deterministic checks]
B --> C{All policy checks pass?}
C -- No --> D[Structured findings returned to author]
D --> A
C -- Yes --> E[AI review agent: interpretive checks]
E --> F{AI review approves?}
F -- No --> G[Natural-language feedback to author]
G --> A
F -- Yes --> H[Human review: judgment calls]
H --> I{Human approves?}
I -- No --> J[Review comments to author]
J --> A
I -- Yes --> K[Merge]
style B fill:#2d3748,color:#fff
style E fill:#4a5568,color:#fff
style H fill:#1a202c,color:#fff
style K fill:#2d3748,color:#fff
The deterministic layer runs first because it is cheap, fast, and covers the largest class of catchable problems. The AI layer runs second because it is more expensive and covers a different class of problems. The human layer runs last because it is most expensive and covers the problems neither of the first two layers can catch.
A system that inverts this order — putting human review before AI review, or AI review before deterministic checks — pays for expensive layers to catch problems that cheaper layers would have caught. A system that skips any of the three layers leaves a class of defects uncovered.
5. Implementation: The Rule Engine
This section describes the implementation patterns that make Policy as Code practical at scale. The patterns are drawn from wixor_policy and its predecessors; similar patterns appear in any mature policy system regardless of language.
5.1 The Rule Behaviour
Every rule should implement a common interface. This enables the engine to load, run, and report on rules uniformly.
defmodule WixorPolicy.Rule do
@callback metadata() :: %{
rule_id: String.t(),
domain: :contract | :app | :repo,
scope: :file | :package | :app | :repo,
name: String.t(),
description: String.t(),
gate: :hard_fail | :release_gate | :warning | :advisory
}
@callback run(context :: WixorPolicy.Context.t()) ::
{:ok, [WixorPolicy.Finding.t()]} | {:error, term()}
@callback fix(finding :: WixorPolicy.Finding.t(), context :: WixorPolicy.Context.t()) ::
{:ok, :applied} | {:skip, String.t()} | {:error, term()}
@optional_callbacks [fix: 2]
end
The metadata/0 callback advertises the rule's identity, domain, scope, and gate. The engine uses this to build the rule inventory, filter rules by scope, and determine gate behavior. The run/1 callback does the actual checking. The fix/2 callback is optional and reserved for rules that can automatically apply their fix.
5.2 Context Loading
The context is the input to every rule. It contains the evidence the rule checks against.
defmodule WixorPolicy.Context do
@type t :: %__MODULE__{
scope_kind: :file | :package | :app | :repo,
workspace_root: Path.t(),
package_path: Path.t() | nil,
otp_app: atom() | nil,
modules: [module()],
manifests: map(),
ast_cache: map(),
framework_meta: map(),
git_info: map()
}
end
Context loading is expensive. Compiling modules, parsing ASTs, and querying framework metadata take real time. A naive implementation reloads the context for every rule and turns a two-minute policy run into a forty-minute one.
The right pattern is lazy evaluation with explicit sharing. The context exposes getter functions that compute evidence on demand and cache the result. Rules that need the AST for a file pay the parse cost once, and every subsequent rule that needs the same AST gets the cached result.
5.3 The Evidence Cache
Beyond in-memory sharing within a single run, the engine should persist evidence across runs. Most policy evidence does not change between runs. If the file lib/foo.ex has the same SHA as it did last run, its AST is unchanged, its compiled module surface is unchanged, its moduledoc is unchanged. Recomputing all of this is wasteful.
The evidence cache stores computed evidence keyed by the SHA of the source files that produced it. When the same rule runs against the same file, the engine retrieves the cached evidence. When the file changes, the cache invalidates automatically.
reports/evidence_cache.<root_basename>.term
The cache is a plain Erlang term serialized to disk. It is rebuilt automatically when files change. It is invalidated manually when the rule set changes in a way that affects what evidence is collected.
This single optimization — evidence caching — reduces the cost of a full repository scan by 10-20x in our deployments. Without it, Policy as Code is impractical for repositories beyond a few hundred thousand lines. With it, a full scan of a multi-million-line monorepo completes in minutes.
5.4 Rule Scoping
Not every rule needs to run against every file. A rule that checks "every LiveView has a data-testid" only needs to examine files that declare a LiveView. A rule that checks "every package registers in the capability manifest" only needs to run once per package, not once per file.
Rules declare their scope in metadata:
:file— runs once per file in scope:package— runs once per package:app— runs once per application:repo— runs once across the repository
The engine uses scope to dispatch rules efficiently. A file-scoped rule that fires 500 times is not invoked once and iterated inside; it is invoked by the engine 500 times, which allows the engine to parallelize, filter, and cache at the rule-invocation boundary.
5.5 The Gate Model
Not every finding should fail the build. Some findings are advisory — worth reporting, not worth blocking. Others are hard blocks. The gate model encodes this directly.
@type gate :: :hard_fail | :release_gate | :warning | :advisory
# Gate semantics
# :hard_fail — blocks merge to main
# :release_gate — does not block merge, blocks release tagging
# :warning — does not block, surfaces in reports
# :advisory — appears only when explicitly requested
Context-aware execution completes the model. The engine accepts a --gate-context flag that controls which gates produce non-zero exit codes.
mix wixor.policy --gate-context merge # hard_fail only
mix wixor.policy --gate-context release # hard_fail + release_gate
mix wixor.policy --gate-context audit # everything
This lets the same rule set drive different CI jobs. Merge CI only fails on hard_fail. Release CI fails on hard_fail or release_gate. Scheduled audit runs surface everything.
5.6 Waivers
Every mature policy system accumulates findings that cannot be fixed immediately. A refactor is planned but not yet scheduled. A legacy module is on the deprecation path. An external dependency violates a convention the team cannot change. These findings need to be suppressible without disabling the rule.
The waiver system allows time-bounded suppression of specific findings.
defmodule MyApp.PolicyWaivers do
use WixorPolicy.Waivers
waive finding_id: "WP-L001:lib/legacy/handler.ex:42",
reason: "legacy module, scheduled for rewrite in Q3 2026",
expires: ~D[2026-09-30]
waive rule_id: "WP-R042",
scope: "packages/external/*",
reason: "external dependency, cannot modify",
expires: :never
end
A waived finding does not disappear from reports — it appears with a waived marker and the reason. The distinction between "no finding" and "waived finding" is important. A waived finding is acknowledged but deferred. A missing finding is invisible.
Waivers have expiration dates. A waiver that expires without action produces a new finding: "waiver expired, original finding not resolved." This prevents waivers from becoming permanent holes in the policy system.
5.7 Incremental Scans
Full repository scans are appropriate for CI, scheduled audits, and release gates. They are too expensive for local development feedback loops. The engine supports incremental scans that check only files changed since a base reference.
mix wixor.policy --changed-only --base origin/main
The incremental mode identifies which files changed, determines which rules might be affected by those changes, and runs only those rules. For a small change — a single file, a few lines — the incremental scan completes in seconds rather than minutes.
This is the single biggest ergonomic improvement a policy system can offer developers. A policy that takes minutes to run discourages local use. A policy that takes seconds integrates into the normal development loop.
6. Implementation: The Three Policy Domains
Policy as Code splits naturally into three domains with different scopes and ownership models. Understanding this split is critical — conflating the three produces a policy system that is impossible to maintain.
6.1 Contract Policy
Contract policy validates the declaration shape of shared modules and packages. It answers questions like: "Does this capability manifest declare the required fields? Does this behaviour module expose the documented callbacks? Does this feature package conform to the contract it claims to implement?"
Contract policy is authored centrally, typically in the same repository as the contract itself. It runs against any consumer of the contract. The ownership model is clear: the owner of the contract owns its policy.
6.2 App Policy
App policy validates how an application consumes shared features. It answers questions like: "Does this app register the required feature packages? Does it configure them correctly? Does it avoid using private APIs of shared modules?"
App policy is authored partially centrally (rules about correct consumption) and partially by the feature package (rules specific to that package's consumption patterns). The ownership model splits: repo-wide rules live in the central engine; feature-specific consumption rules live in the feature package.
6.3 Repo Policy
Repo policy validates monorepo-wide uniformity. It answers questions like: "Do all packages follow the same directory layout? Do all apps use the same test framework? Do all modules have moduledocs?"
Repo policy is always authored centrally. Feature packages do not own repo-wide rules because no single feature has the authority to define repo-wide conventions. Repo policy is how the monorepo itself expresses its architectural consensus.
6.4 Cross-Domain Interaction
The three domains interact. A contract rule may produce findings against a consumer app. An app rule may reference the capability registered by a feature package. A repo rule may assume the contract is well-formed.
The engine handles this by running the domains in order: contract first, app second, repo third. A failure at the contract layer prevents downstream domains from producing confusing findings against code that violates an upstream contract.
Figure 2. Policy domain execution order
flowchart LR
A[Contract Policy] --> B[App Policy]
B --> C[Repo Policy]
C --> D[Aggregate Report]
style A fill:#2d3748,color:#fff
style B fill:#4a5568,color:#fff
style C fill:#1a202c,color:#fff
style D fill:#2d3748,color:#fff
7. Drift: Theory and Measurement
The largest failure mode for Policy as Code is not initial authoring. It is drift — the slow divergence between what the policy says the codebase is and what the codebase actually is. A policy that describes the codebase as it existed six months ago is worse than no policy at all. This section formalizes drift as a concept and presents a measurement framework.
7.1 What Drift Is
Drift is the accumulating gap between encoded policy and actual codebase state. It appears in four distinct forms, each with different causes and different remedies.
Convention drift. The codebase adopts a new convention that the policy does not know about. The new convention is the right answer, but the policy has not been updated to recognize it. The policy continues to enforce the old convention, producing false positives against new, correct code.
Anti-pattern drift. The codebase discovers a new failure mode that the policy does not yet catch. The failure mode is real, but the policy has no rule for it. The same mistake keeps recurring in reviews because the policy system cannot prevent it.
Evidence drift. The rules still describe what the team wants, but the evidence they check is stale — the policy reads cached metadata that no longer reflects reality, or relies on an introspection surface that has changed. The rule produces findings against code that does not deserve them, or misses code that does.
Waiver drift. Waivers that were meant to be temporary become permanent. The underlying finding is never resolved. The waived rule effectively does not exist for the part of the codebase it was suppressed against. The policy system reports compliance for a region of the code that is not actually compliant.
7.2 Why Drift Is the Primary Failure Mode
In our engagement data, 72 percent of policy system failures after the first six months trace to some form of drift. Only 28 percent trace to initial authoring quality. This ratio suggests that the dominant cost of a policy system is not writing the initial rules — it is keeping them current as the codebase evolves.
The implication is architectural. A policy system that is easy to author but hard to maintain is worse than a policy system that is moderately harder to author but easy to maintain. Teams that optimize for authoring speed and ignore maintenance burn through their initial investment within a year. Teams that build for maintenance hold their gains for multiple years.
7.3 Four Drift Metrics
We measure drift along four axes, each addressing a different form.
Evidence freshness. How recently was the evidence the rule checks produced? For evidence derived from compiled modules, this is the timestamp of the _build directory. For AST-based evidence, this is the mtime of the source file at the time it was parsed. A rule checking evidence more than 48 hours old in an actively-developed codebase is likely producing false positives.
Rule hit rate. How often does each rule produce findings? A rule that has not fired in six months is either perfectly enforced (unlikely), describing a pattern that no longer exists in the codebase (likely), or checking evidence that is no longer available (very likely). The distribution of hit rates across the rule set is a leading indicator of drift. A healthy rule set shows a long-tail distribution — a few rules fire often, most fire occasionally, a small number never fire. A degenerate distribution with too many zero-fire rules indicates the policy is no longer describing the codebase.
Waiver age. How old is each active waiver? Waivers are bounded by design — they expire. But many teams set long expiration dates. The median age of active waivers is a direct measure of how much of the codebase is effectively exempt from policy. A median waiver age above 90 days is a warning. Above 180 days, the policy system is partially fictional.
Convention coverage. What percentage of the codebase's actual conventions are encoded as rules? This is the hardest metric to measure because the denominator — "total conventions" — is not known. We estimate it by sampling: we ask a senior engineer to list the 20 most important invariants in the codebase, then check how many are covered by rules. Coverage below 50 percent means the policy is pulling much less weight than it could. Above 80 percent means the policy is close to its practical ceiling.
7.4 Drift Budget
Each of the four metrics can be bounded by an explicit budget.
Table 3. Example drift budgets for a mature policy system
| Metric | Target | Warning | Critical |
|---|---|---|---|
| Evidence freshness (p95) | < 24 hours | 24–72 hours | > 72 hours |
| Zero-fire rules | < 5% | 5–15% | > 15% |
| Median waiver age | < 60 days | 60–120 days | > 120 days |
| Convention coverage | > 75% | 50–75% | < 50% |
A team that establishes these budgets and tracks them as part of normal engineering ceremonies keeps drift bounded. A team that does not track drift at all will exceed one or more of these thresholds within a year.
7.5 Drift Rebaselines
When drift exceeds budget, the remedy is a deliberate rebaseline — a scheduled review of the policy system against the current state of the codebase. A rebaseline has four activities:
Zero-fire audit. For every rule that has not produced a finding in the measurement window, decide: is this rule still relevant? If yes, confirm it is still wired correctly. If no, retire it with a changelog entry.
Waiver cleanup. For every active waiver, review the underlying finding. Is the work still scheduled? Is the exemption still justified? Renew, revise, or retire each waiver explicitly.
Convention audit. For each of the team's top-20 invariants, check coverage. Add rules for uncovered invariants. Retire rules for invariants that are no longer applicable.
Evidence cache invalidation. Clear the evidence cache and run a full scan. This surfaces any rules whose evidence has drifted to a stale source.
A rebaseline is a one-engineer-week investment, typically scheduled quarterly. Teams that skip rebaselines accumulate drift that eventually forces an emergency rebuild of the policy system. Teams that rebaseline regularly never face that cost.
7.6 Drift and AI Rules
The drift story is different for AI rules. Policy rules drift from the code. AI rules drift from each other. A directive set that grows over time accumulates contradictions: "prefer X" in one rule, "avoid X in case Y" in another, "ignore the previous rules when Z" in a third. The AI's ability to follow the rule set degrades as contradictions accumulate, even if no individual rule is wrong.
The right remedy is the same — periodic rebaselines that reconcile the rule set — but the evidence is different. For policy rules, the evidence is "does this rule produce findings against current code?" For AI rules, the evidence is "does the agent's behavior consistently align with this rule across a sample of recent sessions?" The two require different audit procedures.
A mature quality system tracks drift in both layers. Policy drift is measured against the codebase. AI rule drift is measured against agent behavior. Both degrade over time without deliberate maintenance.
8. Accuracy, Coverage, and the 80-90 Percent Ceiling
This section examines the accuracy numbers we report more rigorously. Where do they come from, how are they measured, and what should a team realistically expect from their own deployment?
8.1 Definitions
Accuracy is the percentage of actual invariant violations that the combined system (policy + AI) catches. A violation is an instance where code ships that the team would, on review, agree violates an invariant they care about.
Precision is the percentage of findings that correspond to real violations. A precision of 90 percent means one in ten findings is a false positive.
Recall is the percentage of real violations that produce a finding. A recall of 80 percent means two in ten real violations escape to production.
Coverage is the percentage of the team's enumerated invariants for which at least one rule exists. Coverage differs from recall in that coverage measures "is there a rule?" while recall measures "does the rule fire when it should?"
8.2 How We Measure
Measurement requires ground truth. We generate ground truth by sampling: we take a recent sample of merged pull requests, have senior engineers review them against the team's enumerated invariants, and record every violation they find. This is expensive — each PR takes roughly twenty minutes to audit properly — so we sample rather than exhaustively review.
We then run the combined policy + AI system against the same sample and compare findings against the ground truth. Accuracy is the overlap between findings and violations.
In the twelve engagements where we have run this measurement over multiple quarters, the accuracy numbers cluster tightly: 82 to 91 percent, with a median of 86 percent. The numbers are stable across codebase size, language, and team size, as long as the combined system is in place and maintained.
8.3 What Pushes the Number Up
Within the 80-90 percent band, specific choices correlate with higher numbers.
Teams that maintain evidence hierarchies strictly — avoiding regex, preferring introspection — report higher precision, typically 90-95 percent. Teams that tolerate regex-heavy rules report 75-85 percent precision, which erodes trust and leads to rule suppression.
Teams that run policy in CI on every PR report higher recall than teams that run policy only on schedule. The difference is not about the rules themselves; it is about the feedback loop. Rules that fire on every PR force the team to keep them current.
Teams that rebaseline quarterly report more stable numbers than teams that rebaseline reactively. The difference widens over time — a team that has not rebaselined in a year typically drops to 70-75 percent accuracy even if they started at 88 percent.
8.4 What Caps the Number
The ceiling at 90 percent is empirical, not theoretical. The three factors that bound it map to the three failure modes described earlier: human-judgment invariants, novel cases, and unavoidable drift.
We have seen engagements reach 92-93 percent accuracy briefly, immediately after a rebaseline or after a senior engineer spends a month writing new rules. These numbers are not stable. They revert to the 82-91 percent band within two quarters as new conventions appear, new anti-patterns surface, and waivers accumulate.
Reaching higher than 90 percent reliably would require either dramatically lower-interpretation coverage (which means weaker rules, which means worse precision) or dramatically more human review time (which means the policy system is no longer saving reviewer effort). Both trades undo the reason for having a policy system in the first place.
8.5 What Happens Below 80 Percent
When the combined system falls below 80 percent accuracy, the team stops trusting it. Trust is the load-bearing property of a policy system. Below 80 percent, developers stop reading findings, waivers accumulate indiscriminately, and the rules that do fire get suppressed rather than fixed. The system enters a failure cascade where reduced trust produces more waivers, which produces lower effective accuracy, which produces even less trust.
The 80 percent threshold is not arbitrary. It corresponds roughly to the point at which a finding is more likely to be a real violation than a false positive. Below that threshold, the rational response to a finding is "probably noise." Above that threshold, the rational response is "probably real." The difference in team behavior at these two thresholds is enormous.
Keeping the system above 80 percent is therefore not a quality-of-life concern. It is a survival concern for the policy system itself. A policy system below 80 percent accuracy is on a trajectory to irrelevance regardless of how good its initial design was.
9. Operational Patterns
This section collects the operational patterns we have seen across engagements that kept policy systems healthy over multi-year timescales.
9.1 Policy in CI
The policy engine should run on every pull request. Not on main after merge. Not nightly. On every PR, before merge, with the results surfaced in the PR interface.
- name: Run policy
run: mix wixor.policy --gate-context merge --changed-only --base origin/main
The --changed-only flag is critical. A full-repo scan on every PR is too slow. An incremental scan completes fast enough to not block the PR workflow.
The --gate-context merge flag is also critical. Not every finding should block merge. Release gates run at release time, not at merge time. The same rule set drives both, with different exit-code semantics.
9.2 Policy in the Development Loop
Developers should run policy locally before pushing, not after CI fails. The only way this happens reliably is if the local run is fast.
mix wixor.policy --changed-only
A run that completes in under ten seconds integrates into the normal flow. A run that takes minutes gets skipped. Every optimization — evidence caching, incremental scans, lazy context loading — exists so that the developer loop stays fast enough for the tool to be used.
9.3 Policy in AI Sessions
The policy engine should be invocable from inside AI agent sessions. An agent that can run mix wixor.policy on its own output gets immediate feedback, can self-correct, and produces higher-quality work on the first pass.
This is one of the cleanest demonstrations of the complementarity principle. The AI agent is the non-deterministic layer. The policy engine is the deterministic layer. Giving the agent access to the policy engine lets the two layers compose in the same session: the agent produces output, the policy engine evaluates it, the agent revises based on the evidence. The feedback loop is tight enough that accuracy rises noticeably within a single session.
9.4 Policy in Review
Human reviewers should use policy output as the starting point for review, not as a separate signal. A PR with a clean policy run has already cleared the deterministic layer. The reviewer's time is best spent on the interpretive layer — design judgment, naming, architectural fit — not on catching structural violations the policy would have caught.
This has an organizational implication. Reviewers who spend time catching structural violations that policy should have caught are reviewing the wrong things. The right response to "reviewer caught a violation that policy missed" is not "the reviewer did great work." It is "why didn't policy catch this?" — followed by a new rule that prevents the class of failure in the future.
9.5 Policy as an Onboarding Accelerator
A well-maintained policy system is a remarkably effective onboarding tool. New engineers learn the team's conventions not by reading documentation or asking questions, but by writing code, seeing policy findings, and correcting them. The feedback is immediate, specific, and correct. The learning curve that took weeks on a team without policy takes days on a team with it.
This is a second-order benefit. It does not appear in any policy metric directly. It appears in ramp-up time, first-PR quality for new hires, and the frequency of "obvious" questions in team chat. Teams that track onboarding outcomes consistently report 30-50 percent faster ramp-up after policy adoption.
9.6 Policy as a Migration Tool
Large refactors and migrations become tractable when the target state can be encoded as policy. The team writes rules describing the desired end state. The rules fire against every violation. The team resolves violations one by one, watching the finding count drop toward zero. The migration completes when the finding count is zero.
This is qualitatively different from migrating with ad hoc tracking. A spreadsheet listing files to change goes stale the moment someone modifies a file. A policy rule stays current as long as the evidence hierarchy does not change. Every commit automatically updates the remaining work.
We have used this pattern to migrate test frameworks, rename internal modules, extract shared libraries, and change authentication patterns across monorepos of several million lines. The pattern scales in a way that spreadsheet-driven migrations do not.
10. Organizational Requirements
A policy system is not free. It requires organizational commitment at several levels. Teams that do not make this commitment explicitly will find their policy system degrading within six to twelve months regardless of the initial quality.
10.1 The Policy Owner
A policy system requires an owner — a senior engineer with the authority to author, retire, and modify rules. Without a clear owner, rule changes become political. Rules get added without being retired. Contradictions accumulate. Drift goes unaddressed.
The owner does not need to write every rule. They do need to be the person who decides what the policy system is for, what it will and will not cover, and when a rebaseline is needed. In our experience, a policy owner typically spends 10-15 percent of their time on policy work — enough to keep the system current, not so much that it dominates their role.
10.2 Review Discipline
The team must actually respond to policy findings. This sounds obvious. In practice, teams under schedule pressure often disable rules, mass-waive findings, or skip the policy step entirely. Each of these undoes the system's value.
The discipline required is not complex. Findings are either fixed, waived with a reason and an expiration, or escalated to the policy owner for a rule change. There is no fourth option. Teams that tolerate "just ignore it for now" accumulate drift that compounds rapidly.
10.3 Investment Model
A policy system requires ongoing investment at a predictable rate:
- Initial authoring: 2-4 weeks of senior engineer time for a baseline policy
- Ongoing rule additions: 1-2 rules per week as the codebase evolves
- Quarterly rebaselines: 1 engineer-week every three months
- Tooling investment: incremental, as the engine needs new capabilities
The total cost is roughly 10-15 percent of one senior engineer's time, ongoing, after the initial 2-4 weeks. This is comparable to the cost of maintaining a moderate-size test suite or a linting configuration.
The return on that investment is difficult to measure directly, but the indirect signals are consistent: faster onboarding, fewer repeat review comments, higher AI agent first-pass quality, cleaner migrations, and measurable reductions in escaped defects. Teams that run the numbers typically find the return is 5-10x the investment, though the magnitude depends heavily on codebase size and team composition.
10.4 Relationship to AI Rule Authoring
The team that owns Policy as Code should also own the AI directive set. The two systems are complements, but they are written by the same kind of thinking: "what does this codebase actually require that a general-purpose tool would not know?" Splitting ownership between a "policy team" and an "AI team" leads to redundancy in some areas and gaps in others.
A single owner — or a single small group — maintaining both keeps the two layers coherent. Rules that can be expressed deterministically go into the policy engine. Rules that require judgment go into the AI directive set. A rule never appears in both, because duplication invites divergence.
11. Limitations and Open Questions
11.1 The 80-90 Percent Band Is Not Universal
The accuracy numbers we report come from a specific kind of codebase: multi-app Elixir/Phoenix monorepos, React/TypeScript web applications, and Node/Python API platforms. We have less data on embedded systems, mobile applications, and data pipelines. The principles should transfer, but the specific accuracy numbers may not.
11.2 Evidence Hierarchies Vary by Language
The evidence hierarchy we describe is specific to ecosystems with strong introspection surfaces. Languages with weaker runtime introspection — C, Rust, Go — push more rules toward AST analysis. Languages with weaker AST analysis tooling push rules toward filesystem presence checks. The general principle (prefer higher-quality evidence) holds, but the specific ordering may differ.
11.3 Small Teams May Not Need the Full Apparatus
A team of three engineers working on a single codebase may not benefit from a formal policy engine. The overhead of authoring and maintaining rules may exceed the benefit of catching violations mechanically when everyone on the team already knows every convention. Policy as Code becomes valuable when the team is too large or the codebase too complex for everyone to hold the conventions in their heads.
In our experience, the threshold is somewhere around five engineers or 100,000 lines of code. Below that, informal conventions usually work. Above that, formalization starts to pay off. Well above that — twenty engineers, a million lines — formalization is no longer optional.
11.4 Model Capabilities May Shift the Balance
As AI models become more capable, the ceiling for AI-only rule enforcement may rise. A future model with perfect long-term memory and perfect convention inference might approach deterministic behavior on structural rules. If that happens, the balance between policy and AI rules shifts.
We do not expect this to eliminate the need for Policy as Code. Deterministic enforcement has value that goes beyond accuracy — it provides auditability, uniform application across non-AI-authored code, and build-gating semantics that a probabilistic system cannot match. But the optimal split between the two layers may move.
11.5 Policy Does Not Solve Design Problems
A policy system catches structural violations. It does not catch design problems. A module may pass every policy rule and still be a bad abstraction, a leaky interface, or a poor fit for the problem. No rule system we have designed catches these issues, and we are skeptical that such rules can be written mechanically. Design quality requires human judgment, and the policy system's role is to free up human attention for that judgment — not to replace it.
11.6 Cross-Language Policy Is Hard
In polyglot codebases — a Python backend, a TypeScript frontend, a Go service — each language needs its own evidence layer, its own rule authoring tooling, and its own engine integration. The engine can be shared at the reporting layer, but the evidence collection must be language-native. We have not yet built a cross-language policy engine that feels native in every language simultaneously, and we suspect the architectural cost may be too high to be worth bearing for most teams.
12. Conclusion
AI-assisted software development works best when a deterministic layer and a non-deterministic layer work together. The non-deterministic layer — Claude skills, memory, CLAUDE.md directives, subagent configurations — is excellent at interpretation, judgment, and handling novel context. The deterministic layer — Policy as Code — is excellent at uniform enforcement, build gating, and catching what the non-deterministic layer interprets away.
Neither layer alone reaches the accuracy that modern codebases require. AI rules alone top out around 60-75 percent coverage on the invariants teams actually care about. Generic static analysis tops out around 40-55 percent because it does not know what the team's invariants are. The combination of AI rules with project-specific Policy as Code reliably reaches 80-90 percent — the difference between a codebase that erodes under AI-assisted development and one that hardens under it.
The central practical insight of this paper is architectural. AI rules and policy rules are complements, not substitutes. They cover each other's blind spots. A mature quality system runs both, owned by the same people, maintained at the same cadence, with the same investment model. Teams that treat them as alternatives — "we have AI rules, we do not need policy" or "we have policy, we do not need AI rules" — operate below the ceiling that teams running both reach routinely.
The second insight is methodological. Drift is the primary failure mode of Policy as Code systems. Initial authoring is tractable. Maintenance is where policy systems live or die. Teams that track drift — evidence freshness, rule hit rate, waiver age, convention coverage — and rebaseline on schedule keep their policy systems healthy for years. Teams that author and walk away watch their investment decay within twelve months.
The third insight is philosophical. Policy as Code is not a silver bullet. It catches structural violations. It does not catch design problems. It does not replace human judgment. Its role in the quality system is to handle the deterministic work mechanically so that human judgment — and AI judgment — can be spent on the interpretive work that neither layer can automate.
The pursuit of a single general system that solves every software engineering problem is the same trap we described in our earlier work on agentic pipelines. There is no god factory. There is no universal model. There is no prompt that works for every project. What works is the combination of project-specific deterministic rules with project-specific non-deterministic rules, built and maintained by the team that knows the project best. The deterministic layer is Policy as Code. The work of building it — carefully, against proper evidence, with discipline about drift — is what separates the teams that use AI effectively from the teams that merely use it fast.
Appendix A: Rule Authoring Template
Use this template when authoring a new policy rule.
Rule ID: (stable identifier, e.g., WP-C042)
Domain: (contract | app | repo)
Scope: (file | package | app | repo)
Name: (short, human-readable)
Gate: (hard_fail | release_gate | warning | advisory)
Invariant (one sentence):
What invariant does this rule enforce?
Evidence:
Level 1: Runtime introspection — (yes/no, which functions)
Level 2: AST analysis — (yes/no, which nodes)
Level 3: Framework reflection — (yes/no, which surfaces)
Level 4: Filesystem presence — (yes/no, which paths)
Level 5: Regex — avoid; document justification if used
Finding Shape:
Severity: (derived from gate)
Message: (what the developer sees)
Suggested fix: (if applicable)
Autofix: (if safe to automate)
Test Cases:
Positive: (code that should pass)
Negative: (code that should fail with this finding)
Edge: (ambiguous case to verify handling)
Rationale:
Why does the project need this rule? What incident or
decision motivated it?
Appendix B: Drift Audit Checklist
Use this checklist quarterly to keep drift bounded.
Evidence freshness
- Evidence cache invalidated
- Full policy run completed against fresh build
- Compare p95 evidence age to prior quarter; note trend
Rule hit rate
- Enumerate rules that have not fired in the measurement window
- For each zero-fire rule: retire, repair, or reaffirm with reason
- Distribution of fire counts matches long-tail expectation
Waiver audit
- List all active waivers with age and reason
- Median waiver age within budget (< 60 days target)
- Expired waivers either renewed with justification or removed
- New waivers added in the last quarter reviewed for pattern
Convention coverage
- Senior engineer sample of top-20 invariants
- Coverage percentage recorded and compared to prior quarter
- Uncovered invariants triaged — rule candidates identified
- Rules covering retired invariants queued for removal
Appendix C: Rule-Versus-AI-Directive Decision Guide
Use this decision guide when deciding whether an invariant belongs in policy or in the AI directive set.
Can the invariant be checked from structural evidence
(introspection, AST, framework metadata, filesystem)?
├── Yes → Does the check require interpretation or judgment?
│ ├── No → Policy rule
│ └── Yes → Split: policy rule for the structural part,
│ AI directive for the judgment part
└── No → AI directive
(Consider: can you reshape the evidence so a policy
rule becomes possible? Often yes.)
Is the invariant uniform across the codebase, or does it
have many legitimate exceptions?
├── Uniform → Policy rule (use waivers for rare exceptions)
└── Many exceptions → AI directive (too much suppression load for policy)
Is the invariant about presence/structure, or about quality/intent?
├── Presence/structure → Policy rule
└── Quality/intent → AI directive
Will a violation of the invariant block merge?
├── Yes → Policy rule (AI directives do not gate builds)
└── No → Either layer works; prefer policy for auditability
Appendix D: Finding Schema Example
Example of a well-formed finding produced by a policy rule.
{
"id": "WP-R042:lib/ui/button_live.ex:1",
"rule_id": "WP-R042",
"domain": "repo",
"severity": "release_gate",
"message": "LiveView module missing data-testid attribute on root element",
"subject": "MyApp.UI.ButtonLive",
"file": "lib/ui/button_live.ex",
"line": 17,
"column": 7,
"suggested_fix": "Add `data-testid=\"button-live-root\"` to the root element in render/1",
"autofix_payload": {
"kind": "insert_attribute",
"at": { "file": "lib/ui/button_live.ex", "line": 17 },
"attribute": "data-testid",
"value": "button-live-root"
}
}
Appendix E: Comparison Table — Policy as Code vs AI Rules vs Linting
| Dimension | Linting | AI Rules | Policy as Code |
|---|---|---|---|
| Project specificity | Low (generic rules) | High (directive set) | High (rule authorship) |
| Determinism | High | Low | High |
| Interpretive ability | None | High | None |
| Gate semantics | Build fails | Advisory only | Configurable per rule |
| Applies to non-AI code | Yes | No | Yes |
| Authoring cost | Low | Low to moderate | Moderate to high |
| Maintenance cost | Low | Moderate | Moderate |
| Auditability | High | Low | High |
| Onboarding acceleration | Low | Moderate | High |
| Migration tooling fit | Poor | Poor | Excellent |
| Typical accuracy ceiling | 40–55% | 60–75% | 50–70% (alone) |
| Accuracy ceiling combined with AI | N/A | N/A | 80–90% |
Appendix F: Anti-Pattern Catalog for Policy Authoring
Common mistakes that turn a policy system into an obstacle rather than an asset.
Anti-Pattern: Regex as Source of Truth
FAILURE: Rule uses String.match?(source, ~r/@moduledoc/) to check for a module's docstring.
Why it's wrong: Regex matches textual presence, not semantic presence. The regex fires on a @moduledoc mentioned in a comment, in a docstring of a different module, or in a string literal. It misses modules that declare @moduledoc dynamically via metaprogramming.
CORRECT: Use Code.fetch_docs(module) to query the compiled docstring. This is Level 1 evidence — runtime introspection of what the compiler actually produced.
Anti-Pattern: Undeclared Scope
FAILURE: Rule is written without a scope declaration. Engine runs it against every file in the repository, including generated files, vendored dependencies, and test fixtures.
Why it's wrong: Unscoped rules produce torrents of false positives and slow scans to a crawl. They also communicate poor rule hygiene to other authors.
CORRECT: Declare scope explicitly in metadata. Use :file scope with a path filter, :package scope for package-level invariants, :app for application-level, :repo for repository-wide.
Anti-Pattern: Permanent Waivers
FAILURE: Waivers are added with expires: :never as a matter of routine.
Why it's wrong: Permanent waivers mean the rule effectively does not apply to the waived code. Over time, the waived region grows, the rule's coverage shrinks, and the policy system describes a decreasing share of the actual codebase.
CORRECT: Default to time-bounded waivers (90 days or less). Require explicit justification for longer waivers. Review waiver aging as part of quarterly drift audits.
Anti-Pattern: Silent Rules
FAILURE: A rule fires, produces a finding, but the finding is tagged advisory and never surfaced in CI output.
Why it's wrong: A finding no one sees is equivalent to a finding that does not exist. Advisory rules that are not surfaced accumulate without resolution.
CORRECT: Every rule produces findings that appear in some report visible to the team. Advisory findings go into weekly quality reports. Warning findings surface in PR comments. Gate findings block merges. Each severity has a surface.
Anti-Pattern: Conflicting Policy and AI Rules
FAILURE: Policy rule enforces pattern A. AI directive set instructs the agent to use pattern B. Agent produces pattern B. Policy fails. Author has to resolve the conflict manually.
Why it's wrong: The two layers should not contradict. When they do, the agent cannot succeed regardless of which layer it prioritizes.
CORRECT: Policy rules and AI directives are authored by the same owner. When a rule is added to policy, the corresponding directive is updated or removed. When an AI directive changes, the corresponding policy rule is updated. The two layers describe the same world consistently.