Alle Artikel

This Shit is Hard: How AI keeps our code on standard

Maxime Gréau, Principal Software Engineer

How a fleet of AI agents keeps hundreds of modules on standard — and auto-merges the safe changes.

Keeping a large monorepo on-standard doesn't scale the way you'd hope. The guidelines exist — error-handling conventions, approved libraries, how a GitHub Actions workflow gets pinned — and our engineers know them well. But every standard still has to be kept current as it evolves and applied across hundreds of modules on every change, and at that size, no team can keep pace by hand: the documentation lags the code, a convention drifts, a workflow slips out of policy in a corner nobody looked at this week.

So rather than spend engineering time re-applying the same standards by hand, we built a system to carry that work for us. Each standard lives as a machine-readable artifact, and a fleet of agents built on DriftlessAF — our agentic reconciliation framework — continuously keeps the codebase converging on those standards. 

Across our repositories over the past 8 weeks, that system opened more than 4,700 standards-fix pull requests and auto-merged 75% of those it judged low-risk, with no human in the loop. This post shares how it works and why the thing that makes it safe isn't trusting the AI more, but never letting it merge alone: every auto-merge requires the AI's judgment and deterministic checks to agree.

Write the standard once. Apply it everywhere.

The shift is small to state and large in effect: a standard isn't a paragraph in a doc, it's an artifact an agent can load. You author it once, and it works in two directions. It reviews new code as it comes in, checking each incoming change against the standard before it merges. And it sweeps the existing codebase, opening a fix wherever a module has drifted out of line. One authored artifact, applied to every change landing and every module already there, continuously.

Figure 1: One authored standard, reconciled across every module in parallel. Most pass; the system flags the one that drifted.

A fleet of single-purpose agents: reviewer, skillfixer, autofixer, loganalyzer, judge

We run many narrow agents, each with exactly one job and one definition of done. A reviewer finds violations against the loaded standards, and a skillfixer resolves them. When a change breaks CI, a separate autofixer repairs it, leaning on a LogAnalyzer that distills noisy build logs into a root cause. A judge grades that work offline so we can tell when a standard or a prompt regresses. And the confidence bot scores how safe a finished pull request is to merge. This is the signal that lets low-risk changes land without a human. Reconcilers wire them together.

Part 1 — Skills: A standard that runs

The artifact we encode each standard as is a Skill. It's how an agent knows what "correct" means before it reviews or fixes any code.

What a Skill is, and why it isn't just a prompt

A Skill is a structured instruction set that an agent loads on demand. It follows the open agentskills.io specification, and the shape is deliberately simple: a name, a description that acts as the trigger, a body of direct instructions with concrete right-and-wrong examples, and optional reference files for the long tail of detail. Here's the gist of one of ours, go-standards:

---
name: go-standards
description: Enforces Go conventions for this repository. Use when writing,
  reviewing, or fixing Go code — syntax preferences, library choices,
  error handling, and package design.
---

# Go Standards

## Error wrapping

Always wrap errors with context using %w. Never discard the original error.

Correct:
    return fmt.Errorf("read config %q: %w", path, err)

Wrong:
    return errors.New("failed to read config")

**Audit:** grep for `errors.New` in code paths that already hold an `err`.

What makes this more than a clever prompt is how it loads. Skills use progressive disclosure: the agent sees only each skill's name and description up front (about a hundred tokens each) and reads the full instructions only when a skill is relevant, and the deeper reference files only if needed. So it pulls in the Go conventions when it's looking at Go, the GitHub Actions hardening rules when it's looking at a workflow, the security-review patterns when it's looking at auth code, and nothing else.

And go-standards is one Skill among many — gha-standards for how workflows get pinned and permissioned, go-security-review for the patterns a security engineer looks for. Different kinds of "correct" — style, CI safety, security posture — same shape of artifact, applied by the same agents. Standardizing a new domain means writing a new Skill, not building a new tool.

Skills are code: versioned, reviewed, and eval-tested

We treat a Skill with the same rigor as the application code it governs. A standard that is automatically applied across hundreds of modules has to be right, so a change to a Skill goes through the same pull request, review, and CI as any other change. There's even a Skill for that: skill-authoring encodes what a good Skill looks like, and the reviewer uses it to check every new SKILL.md the way it checks Go code.

That CI does two things:

Check

What it verifies

Catches

Spec compliance

The mechanical contract — valid frontmatter, a description with a real trigger clause, a sane directory structure, a body under the line limit.

A Skill that won't load or trigger.

Golden evals

That the Skill works — given files with known violations, the reviewer agent armed with this Skill flags exactly the right lines, and stays quiet on clean code.

A Skill that triggers but reviews badly.

The grading in that second row is performed by an agent — our Judge — who scores the reviewer's output against the expected result. 

That second row matters most. An eval is a test case for behavior: a file with known problems, paired with the review a correct agent should produce. Evals are what let us trust an agent at all, because agents are non-deterministic. The same Skill can produce slightly different reviews from one run to the next. That's also why a handful of cases isn't enough: in a four-case suite, a single flaky failure drops the Skill below its pass threshold, and the score can't tell a real regression from an unlucky run. We keep ten or more cases per Skill, so the threshold reflects whether the Skill actually works rather than how one run happened to go.

Skills live in a single source-of-truth repository, and when a change lands there, an automated workflow pushes it to every other repo and bot that uses it. What runs in production is exactly what we reviewed and tested, never a stale copy.

Applying skills continuously — the reviewer and fixer agents

A Skill only does something once an agent applies it, and the bot that applies ours is skillup — a reconciler that watches the codebase and, when a relevant file changes, runs two agents in turn. The reviewer agent loads the Skills that match the changed files and reports findings anchored to specific lines, posted back as a GitHub Check Run. The skillfixer agent goes a step further: it produces the conforming change and opens a pull request.

It's a shift in what reviewing means. Instead of applying the same judgment by hand on PR after PR, you author a policy once that captures the sentiment of those reviews, and the system applies it everywhere. It's the only way this kind of work scales.


Part 2 — The agent loop: landing updates at scale

Part 1 made one authored standard trustworthy enough to apply everywhere. This is the second half of the problem: applying and landing those changes at the repo's speed. Two things have to hold: the machinery that opens the PRs has to be safe to run forever, and the thing that merges them has to be safe to trust.

Keeping the codebase converging, on every change

DriftlessAF gives us reconcilers, and that's the engine we point at the standards problem. A reconciler is level-based: it doesn't fire once on an event and hope it worked — it keeps asking whether the codebase matches the desired state and closes the gap when it doesn't. Applied to standards, the question isn't "did someone violate go-standards in this PR?", it's "does every module still match go-standards?". This question is asked continuously, with every change.

That's what makes the system durable instead of a one-shot script. If a fix run hits a flaky model call or a timeout, nothing is lost; the next pass reconciles the same gap. If someone reintroduces a convention we'd already cleaned up, the next pass catches it the same way it did the first time. The standards keep moving, and the code keeps changing, and the reconciler just keeps converging the two.

We get this across the whole fleet because it's the same machinery underneath. skillup (the reviewer and skillfixer that enforces standards), autofix (the fixer that repairs broken CI, which leans on a LogAnalyzer agent to read the logs), and the confidence bot are all reconcilers built on the same foundation, so durability, retries, and concurrency control are solved once and reused. Each one composes one or more agents behind a single service. One pattern, many appliers.

Figure 2: The agent loop. Reviewer and skillfixer act on every change; a stack of deterministic CI checks gates it, with an autofix sub-loop that repairs and re-runs on failure. Only a deterministic gate decides what auto-merges.

When CI breaks, the fixer agent repairs it — and only what's broken

Reconcilers open many PRs, and some of them break CI. That's where the autofixer — the fixer agent in our autofix bot — comes in. It starts from the failing checks and the changed files and investigates from there, the way a human would.

Its first move is usually to look at the logs. But CI logs are thousands of lines of build output, test chatter, and compiler noise, and the actual failure is a few lines somewhere in the middle. So the fixer doesn't read them directly. It calls the LogAnalyzer agent as a tool, and the LogAnalyzer's entire job is to turn that wall of text into one or two structured root causes: an error type, a message, a file and line, the failing command, and a snippet of context with an explanation of why it matters. Most of the time, a simple rule does the work: the errors right before a non-zero exit are usually the cause. The goal is to surface the one or two failures that actually matter.

{
  "summary": "Build failed: reference to a function that was renamed.",
  "failures": [
    {
      "type": "build",
      "severity": "error",
      "error_message": "undefined: validateConfig",
      "location": { "file_path": "config/loader.go", "line": 42 },
      "failing_command": "go build ./...",
      "context": [
        {
          "content": "config/loader.go:42:9: undefined: validateConfig",
          "why_relevant": "Last error before `go build` exited non-zero; validateConfig was renamed to validate."
        }
      ]
    }
  ]
}

Once it has identified the root cause, the fixer reads the relevant files, determines the fix, and implements the change. It fixes only what's broken. It doesn't refactor working code, doesn't "improve" things it notices along the way, and preserves the existing style.

The fixer returns the complete file contents, a commit message in Conventional Commits format, and a short post-mortem. It returns what helped, what got in the way, and what was missing. We feed that post-mortem back into the prompts, so a class of failure the agent struggles with today is one it handles better next month.

The heart — the confidence bot grades merge-safety

Between them, these agents generate far more correct PRs than a person can review and merge by hand. To land that volume, we need a reliable way to tell which changes are safe enough to merge without a human reading every line.

So we built the confidence bot: a reconciler whose only job is to answer one question: how safe is this PR to merge with minimal human review? This bot labels each PR as high, medium, or low. The label is built from several independent checks, so none of them — especially not the LLM — can push a risky change to high on its own.

The score combines a handful of cheap, deterministic signals about the change — how large it is, how many files it touches, how risky the paths are — with one LLM signal that reads the diff and judges intent, coherence, and proportionality against the standards from Part 1. The deterministic signals carry most of the weight. The model is the heaviest single input, but it's still under half the total. Enough to move a grade up or down, but never enough to decide it alone.

Two choices make the grade safe to gate on. Some paths can never score high. A change touching sensitive surfaces like production infrastructure or auth is capped and keeps going to human review, no matter what the model thinks. And the bot is conservative: when it's unsure, it scores lower, because a wrongly-confident high is far more expensive than a wrongly-cautious low. Every grade is auditable, too. Each PR gets a breakdown of which signals moved it, so the score is never a black box.

The loop calibrates itself — reverts teach the system

A static scorer drifts. So the confidence bot keeps a memory — a vector index (RAG) of past PRs and, crucially, what actually happened to them: merged, closed, or reverted. Before it scores a new PR, it retrieves the most similar past PRs. It feeds them to the LLM reviewer as calibration anchors, giving the model concrete precedent for how the repo has actually treated changes like this one.

Reverts matter most here. When a new change resembles one that was previously reverted, the reviewer sees that (and the recorded reason) and grades the new PR lower. The negative example does its job.

This is where humans close the loop. When an engineer reverts an auto-merged PR and explains the revert, the bot links the revert back to the original and stores that reason alongside the negative outcome. An explained revert carries the most signal: the reviewer sees both that a similar change was reverted and why. A revert without a reason still registers as a negative, but the explanation is what future grading actually learns from.

The only actor that merges is deterministic

Here's the line we never cross: the confidence score is advisory. The bot that does the merging is a separate, deterministic approver bot — no model, no judgment, just policy. And it's gated on two things that both have to be true:

  1. A high grade bound to the exact commit SHA. This is the subtle one. The tier label is sticky — it stays on the PR across new pushes until the bot re-grades — so a label alone can let a grade computed for an old, good commit vouch for a new, broken one. Instead, the grade is also published as a check run on the head commit. A check run is scoped to the SHA it was created for; a new commit has no grade until the bot regrades it. A stale grade physically cannot admit a broken commit.

  2. Every required CI check is green on that same commit.

Only when both hold does the approver merge.

A standard is authored and tested once, the agents apply it and grade the result, and a deterministic policy does the merging. The AI does the judging; the merge itself is a check on a commit SHA and green CI. That's what lets the whole loop run without a human gate.

What it looks like in production: the numbers

Over the last 8 weeks, across our repositories, skillup opened 4,746 standards-fix pull requests. The confidence bot — which we rolled out partway through this window — has graded 3,044 of them so far; only the low-risk ones are eligible to auto-merge. Of the 1,117, it judged 840 (75.2%) as low-risk and merged them with no human in the loop.

We expect that 75% to keep climbing, from both ends. The confidence bot gets better the longer it runs: it learns from its own grading history and from what happens to the PRs it graded. Reverts carry the most signal. A merge that sticks only weakly confirms a grade was right, but a reverted auto-merge is concrete proof that a change it called safe wasn't. The bot remembers those cases and grades similar changes more carefully next time. Better calibration means more of the low-risk PRs are safe to merge on their own, and the near-term goal is to auto-merge 100% of them. At the same time, we're adding more deterministic checks to CI. Each new check that can vouch for a class of change moves more PRs into the low-risk bucket in the first place. The share of changes the loop can safely land on both axes keeps growing.


Beyond the monorepo and standards

We built this for our control-plane monorepo, but nothing in it is specific to mono. The same reconcilers and agents run against our data-plane repositories, too, holding them to the same standards and merging the safe changes the same way. One authored standard reaches every repo the fleet watches, not just one.

And standards are a small slice of what the fleet does. The three Skills in this post — go-standards, gha-standards, go-security-review — are part of nearly 100 Skills the agents draw on; most have nothing to do with code style. They encode packaging across a dozen languages, CVE remediation, image building and testing, and more. The reviewer, fixer, LogAnalyzer, judge, and confidence scorer are general DriftlessAF building blocks; across the Chainguard Factory, the same pieces do all of these jobs. The loop in this post is the pattern, not the limit.

Acknowledgments

None of this is the work of one person. Thanks to all the great Chainguard engineering talent: Taylor Bloom, Joe Borg, Jonathan Lange, James Rawlings, Matt Moore, and others, for collaborating on these agents — building and deploying them, and the data pipelines behind them that let us measure what they're doing. Keeping our codebase on standard across many repos is a team effort across engineering, and it keeps getting better because of the people behind it.

Reach out if you are interested in learning more about how our agents work and how Chainguard’s solutions can help you.

Share this article

Verwandte Artikel

Want to learn more about Chainguard?

Contact us