Three Ways an AI Pilot Fails. Only One Is Fatal.

A bank runs an AI pilot to triage fraud alerts. It works. The model clears the low-risk queue faster than the analysts did, the false-positive rate drops, and the steering committee signs off on a production build. Eight months later the fraud team is about as busy as it was before, the model is running fine, and nobody can say where the promised capacity went.

That pilot did not fail. That is what makes it the dangerous one.

When a financial services firm says an AI pilot failed, it usually means one of two things. The pilot did not work, or the pilot worked and nobody used it. Both are real problems. Both are also recoverable, and both are visible. Someone can point at them.

There is a third failure mode that does not look like failure at all. A pilot succeeded, went to production, and never changed the economics of the team it was built to help. From the outside that reads as a win. It is the one that actually kills the project, and it is the one almost nobody writes onto the risk register.

Three failure modes. They are worth naming properly, because the response to each is different, and treating all three as the same problem is how a firm loses its second year.

The first is adoption failure. The pilot works. The model does what it was meant to do. But the people it was built for do not use it, or they use it once and drift back. The underwriters keep their own spreadsheet. The brokers keep ringing the BDM. The analysts keep working the old way, because the old way is what their muscle memory and their bonus both reward.

Adoption failure is the most common and the most discussed, which is also why it is the most survivable. It shows up within weeks. Usage data tells you straight away. And the fix, while not easy, is well understood: bring the users in earlier, take the friction out of the workflow, align the incentives, and sometimes simply require it. Firms climb out of adoption failure all the time.

The second is governance failure. The pilot works, people want it, and then it cannot get out of the building. Risk cannot get comfortable with the model's explainability. Compliance cannot map it to an existing control. The second line asks a question the project team cannot answer, and the production build stalls in a review cycle that never quite closes.

Governance failure costs more than adoption failure, because it tends to surface later, after real money has gone in. But it is still visible. There is a document somewhere with the blocking question written on it. And the regulatory direction is making the requirements clearer rather than murkier. ASIC's 2024 review of how financial services licensees use AI found adoption running ahead of governance across the firms it examined, which means the second line already knows it is behind and is under pressure to close the gap. APRA's CPS 230, in force from July 2025, puts operational risk management of key processes and service providers on an explicit footing. The bar is rising, but it is a bar you can see. Firms recover from governance failure too. It just tends to cost them a year.

The third is operating-model failure. This is the bank fraud pilot. The model works, the analysts use it, it cleared governance. And nothing downstream changed.

The pilot took work out of one step of a process that was never redesigned around the removal. The analyst who used to spend forty minutes on an alert now spends fifteen. But the queue is still assigned the same way, the shift patterns are unchanged, the headcount plan was set before the pilot, the handoff to the investigations team still works as it did, and the capacity the model freed had nowhere defined to go. It dissipated. It went into slightly longer breaks, slightly more careful work, slightly less urgency on a Friday. Real hours, quietly reabsorbed.

The same shape shows up in insurance. A claims-triage pilot sorts incoming claims by complexity well enough that the straightforward ones can move to fast settlement. It works. But the claims team is still organised into the same general-purpose pods, still measured the same way, still staffed against the old assumption that every claim needs a full assessor's eye. The model sorted the claims. The firm never re-sorted the team. A year on, the straightforward claims are moving through the same pods at the same pace as before, because the pods themselves were the constraint, and the pilot never touched them.

There is a reason this third mode is becoming the common one. A decade ago, taking work out of a process step meant a system change, and a system change forced the operating-model conversation whether the firm wanted it or not. You could not put in a new core platform without redrawing roles and handoffs, because the platform would not run until you did. AI pilots do not force that conversation. A model can sit on top of the existing process, lift work out of one step, and leave everything around it untouched. The technology no longer makes the firm do the hard part. The hard part gets skipped.

Operating-model failure is the one that kills the project, for three reasons.

It is invisible. The pilot succeeded on every metric the pilot was measured against. Model performance, accuracy, user adoption, all green. The failure sits in a number nobody attached to the pilot: the real unit economics of the team a year later.

It compounds. Operating-model failure does not cost a firm the pilot budget. It costs the firm the business case. The board approved the spend against a projected efficiency, and the efficiency is sitting in the building, unbanked. Every quarter that passes is a quarter the firm pays for a capability it is not capturing.

And it is not a technology problem, which means the technology team cannot fix it. Redesigning how fraud alerts are assigned, how the team is structured, where the freed capacity is redeployed: that is operating-model work. It needs a different owner, a different budget line, and a different kind of decision than "approve the production build."

That owner is the hard part to find. The technology team owns the model and hands it over once it works. The business line owns the P&L and assumes the efficiency will arrive on its own. The redesign of roles, queues, handoffs and structure sits in the gap between them, and a gap is not a person. It goes unowned. Nobody decided it did not matter; the org chart has no obvious box for it. The firms that get this right give the operating-model change to a named executive on the business side, and treat the production build as that executive's deliverable rather than the technology team's.

The pattern holds across the sector. Adoption failure and governance failure get caught because they come with natural alarms. Someone is not using the thing, or someone will not approve the thing. Operating-model failure has no alarm, because everyone who could raise it is looking at a green dashboard.

Before the pilot goes to production, write down the specific operating-model change that has to happen for the benefit to become real. State it as a concrete mechanism: "The fraud queue moves from round-robin assignment to risk-weighted assignment." "Two analyst roles convert to investigation roles." "The claims team goes from five pods to three." Name it, give it an owner who does not sit on the technology side, and put a date on it.

Then add one line to the pilot's success criteria. The model working is the milestone before the real one. The pilot is finished when the operating-model change has happened and the freed capacity has a defined destination.

If you cannot name the operating-model change, the pilot is not ready for production, however well the model performs. It is a demo, and it is on its way to becoming a permanent cost.

There is a faster version of the same test for anyone reviewing a portfolio of pilots. Ask each pilot sponsor one question: when this goes to production, what stops happening, what starts happening, and who owns making that true? A sponsor who answers cleanly has done the operating-model work. A sponsor who answers in terms of the model's accuracy has not started it.

The test works in any vertical, because the failure mode is not sector-specific even though the technology is. A lender automating document collection has to decide what the loan processors do with the recovered hours, or the cycle time does not move. An insurer triaging claims with a model has to restructure the claims pods, or the claims sit in the same queues they always did. The model is the part that looks like the project. The operating-model change is the project.

Back to the bank. Eight months in, the fraud pilot is running, the metrics are green, and the capacity is gone. The honest account is this: the firm ran a successful pilot and never ran the project the pilot was meant to start. The model was the easy part. The operating-model change was the hard part, and it never got an owner.

Two of the three failure modes announce themselves. The third waits for the year-two budget review, when someone finally asks where the efficiency went. By then the pilot is long over. What is on the table is the business case, and whoever signed it.

Three Ways an AI Pilot Fails. Only One Is Fatal.

Related insights

The AI Productivity Number Lies. Read It Like a CFO.

AI Did Not Change Your Compliance. It Changed Your Artefacts.

Build vs Buy: The Real AI Decision in Wealth Management