
For decades, software development has run on a critical, yet unwritten, assumption: we trust the people writing the code.
We trust that changes are intentional, scoped, and understood. We trust that when something breaks, someone in the loop knows where to look and how to fix it. That trust has quietly underpinned everything from code review practices to release pipelines.
In a recent conversation with Axian Lead Solutions Architect Gabe Harris on release controls for AI-assisted changes in production, that assumption came into focus in a new way. Not because it was ever formally defined. But because it is now being tested.
While AI-assisted development doesn’t invalidate the assumption of implicit trust outright, it does bring its limitations into view.
Today, roughly 30% of daily merged code is AI-authored (up from 22% in Q4 of 2025), and developers using AI tools are merging about 60% more pull requests than those who don’t. More code is moving through systems faster than ever before, and continued acceleration seems like a safe assumption for the foreseeable future. So, the pace has changed. The inputs have changed. However, the shared understanding behind those changes has not kept up.
Like the magically animated broom in the Sorcerer’s Apprentice, the fundamental question in play here is not whether generative AI tools will carry water for us. Rather, it’s how we earn the wizard’s cap and safely maintain control over powerful tools which are continuing to evolve.
The Invisible Contract Behind Every Release
Every release process has a technical architecture. It also has a social one.
For most modern software development, those two systems have depended on the same basic sequence: humans write code, humans review code, and humans understand the intended scope of what changed. This does not mean every change is perfect or that every engineer sees every downstream consequence. It simply means the process assumes bounded intent.
Experienced engineers make thousands of small decisions while building a feature. They decide which files to touch, which patterns to preserve, which dependencies to leave alone, and which adjacent problems are outside the task at hand. Most of these judgments never appear in a ticket or pull request description. They live in the engineer’s understanding of the system.
That is the invisible contract behind a release: the person changing the system has some practical sense of where change should begin, where it should end, and what it should not disturb.
As Gabe puts it, “Everybody’s trying to figure out… how do we build a system where we can trust the output in the same way we were able to trust it previously?”
AI-assisted development forces that question into the open. The old model did not enforce trust with much precision. It assumed trust because the author, reviewer, and release process all operated inside a shared human framework.
The Good, the Bad, and the Unbounded
AI doesn’t write bad code. It writes unbounded code. As Gabe puts it, “AI-generated code is not bad. It’s average.” In many cases, the output is perfectly serviceable. It compiles. It follows recognizable patterns. It often aligns with established best practices. If anything, it reflects the statistical center of how software tends to be written.
The difference shows up in how that code behaves within a system.
Human-written code is typically scoped by intent. An experienced engineer approaches a task with an implicit understanding of boundaries: what needs to change, what should remain untouched, and what belongs to another part of the system. The human engineer’s judgment is shaped by context, ownership, and tradeoffs that rarely appear in documentation but heavily influence the outcome.
AI-generated code operates differently.
Given sufficient context, it tends to complete patterns wherever it detects them. It does not inherently recognize implicit organizational boundaries or informal ownership rules. If a solution appears to involve adjacent modules, it may modify them. If a pattern exists in one part of the system but not another, it may extend that pattern globally. The results are often changes that are technically consistent but operationally unexpected, if not altogether problematic.
Side effects begin to emerge.
A change intended for one feature can ripple into authentication logic, data handling, or payment workflows. Not because the code is incorrect, but because the system was interpreted more broadly than intended by an agent unattuned to the subtleties of tacit human context.
The problem, then, isn’t correctness. It’s a scope explosion.
We’ve Seen This Failure Mode Before
While the public record and, by extension, public awareness regarding AI-assisted code failures are emerging daily, the risk pattern is not new.
The underlying pattern is already well documented.
We have seen what happens when systems behave in ways their operators don’t fully understand or can’t fully control. In 2012, Knight Capital lost $440 million in under an hour when a seemingly benign deployment issue triggered unintended behavior across its trading systems.
Zillow’s pricing models operated as designed, yet quietly drove decisions that resulted in hundreds of millions in losses, ultimately contributing to the shutdown of its Zillow Offers business in 2021 after losses approached $881 million.
A configuration change famously took down Facebook’s global services in 2021, not because the system broke in an obvious way but because its interconnected behavior exceeded the team’s ability to manage it in real time.
Each of these incidents is different on the surface.
- One is a deployment failure.
- One is a model-driven business decision system.
- One is infrastructure misconfiguration.
But they share a common structure: the system continued operating beyond the bounds its operators expected and the mechanisms to detect, contain, or reverse that behavior were insufficient.
Knight Capital lacked effective rollback controls. Zillow lacked sufficient visibility into how model outputs translated into business risk at scale. Facebook’s internal systems failed in a way that made recovery more difficult precisely when control was most needed.
These weren’t failures of bad code. They were failures of control.
AI-assisted development introduces an analogous problem, only with the gas pedal floored.
You Can’t Test What You Don’t Expect
Discussing the implications of the predicament, Gabe frames the problem directly: “How do you build gates… against unknown unknowns?”
This question gets to the core limitation of traditional release and quality practices. Most testing frameworks, however sophisticated, are built around known failure modes. Teams define expected inputs, expected behaviors, and known edge cases. They simulate what they can anticipate and they build confidence from coverage across those scenarios.
This model works as long as the system behaves within understood boundaries.
AI-assisted development introduces a different class of risk. Systems begin to exhibit emergent behavior, where outcomes arise from interactions that were not explicitly designed or reviewed. Being non-deterministic, outputs can vary across runs, even with similar inputs in environments that have historically relied on predictability. And because AI-generated changes can span multiple parts of a codebase, the resulting interactions are often system-wide rather than localized.
These are not bugs in the traditional sense. They are consequences of complexity operating beyond full human comprehension.
This shift is already visible at the organizational level. Recent research shows pull requests created with AI tools contain roughly 1.7 times more issues than human-written code, dramatically increasing the likelihood that defects slip through review and into production.
The implication is straightforward. Risk is no longer confined to rare edge cases that can be enumerated and tested away. It’s embedded in how the system behaves as a whole.
Not All Code Is Created Equal Anymore
So how do we address this systemic shift?
“To start,” says Gabe, “you need to create a categorization system for how that code came into being.”
This shift is more than administrative. It reflects a structural change in how software is produced and evaluated. In an exclusively human development model, all PRs can be treated as broadly equivalent because they originate from a similar process. That is no longer the case.
Today, code enters the system through multiple paths. Some changes are fully human-authored. Others are created with AI assistance, where a developer guides and edits the generated output. And increasingly, some are produced in largely agentic workflows, where systems generate, modify, and propose changes with limited direct intervention or human oversight.
These differences are not cosmetic. Each tier carries a different risk profile.
Human-authored code reflects bounded intent, shaped by experience and contextual judgment. Assisted code blends that intent with pattern-driven generation, introducing subtle uncertainty. Agentic code can operate across a wider scope, increasing the likelihood of unintended interactions.
Those differences demand different controls. Review thresholds, testing depth, and approval requirements can no longer be uniform. They must reflect how the code was produced and how far its effects might reach.
The New Role of the Pull Request
Under the new regime, pull requests become the first place where teams can reassert control over how changes enter the system.
That starts with adjusting review expectations. Agentic or heavily AI-assisted changes may require multiple reviewers, not just for validation but for interpretation. In sensitive areas like payments, authentication, or data handling, reviews must become targeted, pulling in domain owners who understand the downstream impact of even small changes.
The checks themselves also evolve. Traditional reviews focus on correctness and style. AI-assisted workflows require additional scrutiny for patterns that signal overreach, unintended modifications, or system-wide effects that would not typically be introduced by a human developer working within a defined scope. Before assessing quality, reviews must first answer the question: “Do we understand what this change touches?”
Shipping Code vs. Containing Risk
As AI-assisted development accelerates output, the release strategy must shift from proving confidence to limiting exposure. Teams seeing major productivity gains from AI-enabled development are not simply producing more code. They are pushing more change into systems that may not have evolved at the same pace.
That makes staged release models more important. Canary releases, phased rollouts, and limited audience exposure give teams a way to observe behavior before the change reaches everyone. The point is less to prove releases are safe than it is to make sure any failure starts small enough to understand and contain.
Beyond confidence, the goal is containment.
Successful containment also depends on rollback posture. Feature flags and kill switches let teams disable risky functionality without waiting for a full redeploy. Explicit rollback plans for both code and data help prevent a bad release from becoming an unrecoverable business event.
Knight Capital remains the cautionary example here. The problem was not just that unintended trading behavior occurred. It was that the firm could not stop it fast enough, even after the problem had been discovered. In other words, if you can’t stop it, you don’t control it.
When Everything Looks Fine but Nothing Is
Traditional observability focuses on system health indicators like latency, uptime, or error rates. While these signals still matter, they no longer tell the full story. A system can be fast, available, and technically correct while producing outcomes that are materially wrong.
AI-assisted changes tend to fail at the level of interaction and interpretation, rather than execution. That means the first signal of a problem often appears in business metrics rather than system alerts. Pricing behaves unexpectedly. Transaction volumes spike or drop without explanation. Data patterns shift in ways no one explicitly designed.
Remember, Zillow’s pricing model did not crash. It produced decisions that looked reasonable locally and failed at scale. Organizations must broaden the scope of their testing and review process to include system effects, even at significant downstream reach, alongside system functionality.
Persistent Problems Demand Evolution
This shift in perspective brings a new reality into focus. Looping AI into the development cycle, with ever-increasing autonomy, introduces a persistent, intractable problem that requires organizations to evolve, rather than solve.
Gabe says, “I would never be comfortable advising a client to set up a good set of gates… and then just walk away.”
Models change. Behavior shifts. Outputs evolve even when inputs appear consistent. Controls that work today will be less reliable in a month and likely untenable altogether in a year. This brings reviews and release controls closer to a security discipline than traditional QA processes. Release controls, observability, and review practices must keep adapting alongside the tools they are meant to govern. For organizations navigating AI-assisted development, the goal is not a permanent, one-off solution. It’s a disciplined capacity to keep earning trust as the system changes.
Release Controls for AI-Assisted Development with Axian
AI-assisted development isn’t slowing down.
The question is whether your release process is keeping up.
If you don’t have clear controls around how AI-generated changes enter, move through, and impact your systems, you’re not managing risk. You’re inheriting it.
Axian helps engineering teams design release controls, observability, and rollback strategies that keep systems predictable without slowing delivery.
If you’re rethinking how to maintain control as AI accelerates development, let’s talk.
Contact Axian to start the conversation.