Agentic Engineering: Why Better AI Won’t Fix Delivery Problems

Most people remember Apollo 11 as a triumph of astronauts.

That is understandable as the astronauts were the visible heroes of the mission. Their names entered history books, their faces appeared on magazine covers, and their achievements became symbols of technical excellence.

But Apollo 11 did not succeed simply because NASA found the perfect astronauts.

Humanity reached the Moon because NASA built Mission Control first.

Behind every launch stood a system designed to make success possible: procedures, simulations, telemetry, training, and operating disciplines capable of supporting humans under conditions no one had experienced before. While the astronauts were important, the operational environment built around the astronauts was the truly novel determinant in this breakthrough human endeavor.

Today, many organizations are approaching AI from the opposite direction.

They are searching for the perfect model: GPT, Claude, Gemini, et al. that will act as a silver bullet to their AI problems. Leadership teams compare benchmarks, evaluate subscriptions, and debate which platform will deliver the greatest “turn key” productivity gains.

Yet emerging evidence suggests they may be focused on the wrong lever. In a recent study, experienced developers using AI coding tools were found to be 19% less productive than their peers in certain real-world engineering tasks. This finding challenges one of the most common assumptions surrounding AI adoption: that better AI automatically produces better outcomes.

Organizations are increasingly discovering that the differentiator is not the model itself. It is the environment surrounding the model.

Axian CTO and Practice Lead Tyler Holmes delved into this exact distinction in a recent conversation:

“The goal is not to find one perfect model.The goal is to create a curated environment where many models can be successful.”

That changes the question entirely.

More AI ≠ More Velocity

The first wave of AI adoption in software development followed a simple equation: more AI equals more code. In many cases, that prediction proved correct. Developers using AI assistants can generate code faster, explore solutions more quickly, and complete routine tasks with less conscious effort than before.

But “lines of code output” and “delivery” are not the same thing as velocity.

Every line of generated code still must be reviewed, tested, secured, deployed, monitored, and maintained. As output accelerates, those downstream processes often become the new bottleneck. Review queues grow. Testing requirements expand. Engineers spend more time validating changes they did not write themselves, and confidence becomes harder to maintain.

This is why some organizations experience a surprising outcome after adopting AI tools: they generate more software activity without seeing a proportional increase in shipped software.

As Tyler Holmes observes, “The emerging craft of agentic engineering sits on top of software engineering.” AI primarily accelerates generation. It does not automatically improve the systems responsible for absorbing, validating, and shipping what gets generated.

The result is a lesson many organizations are learning in real time: more code is easy. More trusted software is not.

The Wrong Question: Which Model Should We Buy?

Most organizations begin their AI journey the same way: by evaluating models.

They compare benchmark scores, pricing structures, context windows, coding performance, reasoning ability, and deployment options. They debate whether GPT outperforms Claude, whether open-weight models offer strategic advantages, and whether buying access to the most capable model available will create durable differentiation.

While still necessary, these questions are rapidly becoming less important.

As Holmes puts it, “The best agentic engineers are not simply better prompt writers. They are usually software engineers who are practiced at assembling/supporting toolchains and processes to get the best output from humans, and now agents.”

This observation touches on a broader shift taking place across engineering organizations. Model capability still matters, but it increasingly behaves like a multiplier rather than a differentiator.

Recent research suggests AI acts primarily as an amplifier of existing organizational strengths and weaknesses rather than a universal productivity solution. Organizations with strong testing practices, clear documentation, established standards, and disciplined validation processes often see meaningful acceleration when implementing new AI-driven processes. However, organizations without those foundations frequently discover that AI simply helps them produce confusion faster.

This is the inversion at the center of agentic engineering.

Competitive advantage is moving away from selecting the perfect model and toward designing environments where many models can succeed. The organizations creating the most value are increasingly the ones that have invested in context, governance, tooling, and feedback systems rather than treating AI as a standalone productivity layer.

So, what does an environment built to include agents actually look like?

What Agentic Engineers Actually Engineer

Traditional software engineering has generally focused on producing software directly. Teams define requirements, write code, test behavior, and move applications into production.

Agentic engineering shifts the center of gravity. Rather than treating software as the primary output, agentic engineering treats the environment that produces software as the thing being engineered.

A software engineer will improve a system by modifying and testing the change. An agentic engineer will improve a system by enhancing the agent harness and environment to produce acceptable outcomes more reliably (e.g. improve the harness, tools, skills, context, and supporting DevOps pipelines) all while maintaining their understanding of the product being generated or produced.

That does not mean software engineers disappear. As Holmes repeatedly emphasized throughout our conversation, agentic engineering sits on top of software engineering rather than replacing it. The difference is where (human) engineering effort gets concentrated.

Curated Context

The first responsibility is context.

A common assumption is that better AI outcomes come from giving models more information. In practice, effective environments tend to provide more curated information (in the right amount, at the right time) to agents instead. Documentation, architectural guidance, examples, standards, and relevant system knowledge become production assets because they shape the quality of outputs.

Tooling

Models become significantly more capable when they can act rather than simply respond.

This layer includes APIs/MCPs, custom harness tools, command-line access, execution environments/sandboxes, testing tools, and access patterns that allow systems to retrieve information, perform actions, and validate results. The goal is to increase autonomy, so that the agent can test its own work, explore on its own, and stay in the loop.

Sandboxes

More capable systems require right-sized boundaries. Sandboxes create controlled environments where agents can experiment, execute workflows, and iterate without creating production risks. They allow organizations to increase throughput while preserving confidence.

Validation

This is where the entire approach either succeeds or fails. As Holmes puts it:

“You can’t really get into agentic engineering unless you’ve got fantastic testing.”

Testing, evaluation, review systems, and feedback loops become the mechanism that transforms generated output into trusted output. Viewed together, these layers suggest a different way of thinking about AI adoption. The model is increasingly becoming a component, while the environment around the model is becoming the product.

One useful way to think about this shift is that traditional software teams historically optimized for predictable execution, while agentic environments try to create predictable generation from otherwise heuristic/statistical models. That distinction sounds subtle, but it changes where investment goes. Successful teams are treating documentation as infrastructure, test suites as interfaces, and operational controls as part of the development process rather than governance overhead layered on afterward.

The result is that engineering maturity becomes visible earlier. Organizations no longer discover weaknesses after deployment, as the environment increasingly reveals them during generation.

Why Strong Engineering Teams Get Stronger

One of the more counterintuitive effects of AI-assisted development is that it does not appear to flatten differences between engineering organizations in the way many initially expected. If anything, it may widen them.

Recent reporting shows 92% of developers already use AI to generate code. If that trend continues, access to capable models will become increasingly common and the competitive advantage they offer will erode.

Organizations will no longer differentiate primarily through access to better generation. They will differentiate through their ability to focus GenAI into areas of their business that benefit from the lift and can absorb what gets generated. This is where engineering maturity becomes difficult to ignore.

Organizations built around tribal knowledge, undocumented processes, inconsistent standards, and weak testing often discover that AI accelerates existing problems. More changes move through the system, but uncertainty increases with output. Teams spend more time reviewing, validating, and reconciling work than expected.

By contrast, strong engineering organizations frequently experience a different outcome. Clear documentation creates a better context. Standards create more consistent outputs. Testing creates confidence. Governance creates trust.

As Holmes observes, “AI adoption often pushes teams back toward fundamentals.”

This idea may feel counterintuitive at first. AI is often framed as a technological leap forward. Yet many of the organizations benefiting most are succeeding through disciplines that would have looked familiar ten years ago: documentation, validation, testing, review practices, and operational ownership.

AI does not replace those fundamentals. It exposes them and redefines leadership priorities.

The Strategic Question for Engineering Leaders

As better models cease being the primary source of differentiation, leadership priorities begin to change.

The first generation of executive AI conversations tended to revolve around procurement and adoption. Which model should we buy? How many licenses should we provision? How quickly can teams begin using these tools?

Now these problems sit lower in the stack of priorities.

Organizations beginning to realize meaningful value from AI appear to be asking different questions. What context should the model have access to and does it already exist? How is output validated? Who owns the generated artifacts once they enter production? What actions can execute autonomously? What decisions must remain human?

These questions point toward a broader shift in thinking.

Agentic environments are not primarily evaluated by model performance. They are evaluated by their ability to repeatedly produce trustworthy outcomes.

In practical terms, organizations should begin assessing readiness across four dimensions:

  • Context: Can systems access current, right-sized, trustworthy information?
  • Controls: Can output be tested, shaped, and governed?
  • Validation: Can the generated work be tested and trusted?
  • Ownership: Can someone who already works here maintain and operate these systems over time?

None of these questions are entirely new. Engineering organizations have always cared about ownership, controls, and validation. What changes in an agentic environment is the order of operations. Historically, teams built software and then introduced controls around it. Now those controls are becoming a critical part of the generation process itself.

That shift creates a subtle but important change in leadership behavior. Instead of asking how quickly teams can adopt AI, leaders begin asking what conditions allow adoption to compound safely.

Viewed this way, AI strategy starts to look less like software procurement and more like operational design. The more you manage software delivery, the more you can orchestrate systems that generate software. This shift changes where competitive advantage comes from.

Part of the challenge in adapting to this shift is that models are easier to buy than environments are to build or improve. A subscription can be approved in a budget cycle. Better documentation, stronger testing practices, clearer ownership models, and more trustworthy operational controls require sustained organizational effort. Yet these less visible investments are now largely determining whether AI becomes a durable capability or another short-lived productivity initiative.

Stop Looking for Better Astronauts

Apollo 11 did not succeed because NASA discovered a new category of astronaut.

It succeeded because NASA built an environment capable of supporting extraordinary outcomes repeatedly and under pressure. Mission Control, procedures, simulations, telemetry, testing, and operational discipline transformed human capability into reliable execution.

In the development of modern AI strategies, history may be repeating itself.

Today, many organizations are still competing at the visible layer of the stack. They compare models, benchmark performance, negotiate subscriptions, and search for the combination of capabilities that will unlock meaningful gains in productivity and delivery.

These activities are understandable and not unimportant. But they are also becoming easier to replicate.

As before, the harder work is less visible. It lives in documentation that improves context. In testing systems that create trust. In tooling that enables execution. In the governance that defines boundaries. In operational practices that determine whether generated outputs can become production outcomes.

Holmes returns to the same underlying idea here: “A good agentic engineer doesn’t make the organization money by generating more code faster. The real value comes from creating an environment where basically any AI model… can participate in valuable work that aligns with the goals, needs, and style of the organization.” Rather than the model itself, the scarce resource is the environment around the model.

Organizations that treat AI as a procurement problem may see short-term gains. Organizations that treat AI as an engineering and operating model will build advantages that compound. Agentic engineering then becomes less of a replacement for software engineering and more of a question of how to best leverage existing software engineering skills.

The organizations that reap the most value from AI will not generate the most code. They will be the organizations that build environments where increasingly capable systems remain understandable, governable, and trusted.

Need Help Building AI Environments That Scale?

AI value does not come from selecting the perfect model. It comes from creating the systems that allow increasingly capable models to produce reliable outcomes. Axian advises on, architects, and delivers software and data solutions that help organizations operationalize AI through stronger engineering practices, testing, governance, and delivery systems.

Ready to build AI environments that scale?

AI value does not come from selecting the perfect model. It comes from creating the systems that allow capable models to produce reliable, governable outcomes. 

Axian helps organizations design and deliver the engineering practices, testing, governance, and delivery systems needed to operationalize AI with confidence. Contact Axian to start the conversation.