How to Measure AI Automation ROI Before You've Shipped Anything

There is a short conversation that decides whether an AI pilot ever ships. It happens between a project sponsor and the CFO. The sponsor says they want to fund a pilot. The CFO asks what the return looks like. The sponsor produces a number that is either too vague to act on or too precise to be credible. The pilot does not get funded, or it gets funded for the wrong reasons.

This is avoidable. A pre-flight ROI model is not a forecast in the financial-planning sense. It is a structured argument about whether the savings exceed the cost under a set of stated assumptions. The model survives contact with the CFO when those assumptions are made explicit and the math is reproducible. It dies when the inputs are hand-waved.

The reason most AI pilots fail to produce ROI numbers is not that the math is hard. It is that vendors and internal sponsors routinely ship two-input models (cost of labour, percentage automated) when the math actually requires six. This article walks through the six inputs, shows a worked example for a contract-review pilot, and explains why the third input is the one that, if you skip it, will make every other number you produce wrong.

Why most AI pilots cannot produce ROI numbers

Three independent studies in 2024 and 2025 painted the same picture. McKinsey's 2025 State of AI (March 2025, n above 1,400 respondents) found 39% of firms report any EBIT impact from gen AI, but only around 6% report 5% or more EBIT lift. The MIT NANDA report The GenAI Divide: State of AI in Business 2025 (August 2025, based on 150 leader interviews, a 350-employee survey, and 300 reviewed deployments) found 95% of enterprise GenAI pilots produce no measurable P&L impact. Gartner's April 2026 release on I&O AI use cases (n=782) found only 28% of AI use cases fully meet ROI expectations and 20% fail outright.

These reports do not disagree. They are measuring different parts of the same elephant. The common thread is that ROI was almost never modelled rigorously before the pilot. When ROI is modelled at all, it usually rests on two assumptions: the percentage of work the AI will do, and the cost of the human doing the work today. If those are the only two inputs, the model is a vibe.

The six inputs every pre-flight ROI model needs

A defensible pre-flight ROI model has six inputs. Skip one, and the number you produce is unreliable in a specific, predictable way. We will walk through each, then show how they compose in a worked example.

Input 1: Baseline cost

Baseline cost is the fully-loaded cost of doing the workflow today. The formula is: FTE count × fully-loaded hourly rate × hours per unit × units per year. Most pilots skip this because nobody has measured it. Without a baseline, 'savings' is a relative number with nothing on the other side of the comparison. If your project plan starts with 'we will save 30% on document review,' but you cannot say 30% of what, you do not have an ROI model. You have a marketing claim.

Input 2: Volume in scope

Not all of the workflow is AI-addressable. Filter out the cases the model cannot safely handle: regulated edge cases, novel matter types, low-frequency exceptions, anything where the policy is still being written. The formula is: total units × percentage AI-addressable. The percentage is not 100, even when the vendor implies it is. Identifying it honestly is the cheapest hedge against an over-promised pilot.

Input 3: Acceptance bar (the input most often skipped)

This is the input that most vendor ROI calculators silently omit, and it is the one that, when you add it, changes the number most. The acceptance bar is the minimum quality the business will sign off on. For some workflows that is 90% extraction accuracy. For others it is 99.5% (legal sign-off, regulated medical records, audit work). The acceptance bar drives how much human review is required.

Input 4: Automation rate (straight-through processing)

The automation rate is the percentage of in-scope units the AI completes without a human edit. It is distinct from accuracy. A model can be 95% accurate and have a 30% automation rate if every output still gets a human review pass. Automation rate is what your finance team actually feels: it is the share of work that flows through the pipeline without touching a paid human.

Input 5: Human-review and rework cost

What does the AI's output cost in human time after it is generated? The formula is: review hours on flagged items + rework hours on auto-approved items that fall short of the acceptance bar + sampling hours on auto-approved items as quality control. The third term is usually 5 to 10% of auto-approved volume. The middle term is what bites you when model quality is below the acceptance bar.

Input 6: Total cost of ownership (build plus run)

Pilot budgets routinely exclude prompt and eval engineering (typically 15 to 25% of build), evaluation-set maintenance, model-version churn, monitoring and drift remediation, and change management. Add it up: build, LLM API, integration, maintenance FTE, evaluation cost, change-management cost. Net annual savings = baseline cost minus (review and rework cost + year-one TCO). Payback period in months = year-one TCO divided by monthly net savings.

A worked example: contract review, one paralegal of capacity

Here is what the model looks like when you actually fill in the six inputs for a realistic professional services workflow.

Baseline cost: 1 FTE paralegal at fully-loaded $75 per hour, 2,000 hours per year reviewing contracts at ~12 contracts per hour, so 24,000 contracts per year. Baseline = $150,000 per year.
Volume in scope: 100% of NDAs and standard MSAs, 0% of bespoke commercial contracts, net 24,000 in scope.
Acceptance bar: 99% extraction accuracy on the auto-approved bucket (legal sign-off requirement).
Automation rate: pilot model runs at 70% straight-through, 92% accuracy on the auto-approved bucket.
Human-review and rework: reviewer time on flagged items ≈ 4 minutes; QA sample on auto-approved is 10% at 2 minutes; rework on auto-approved errors below acceptance bar is roughly 8 minutes per remediated case.
TCO year one: ≈ $40,000 (build amortised, LLM API, eval-set maintenance, monitoring).

Now the math. Auto-approved volume = 24,000 × 0.70 = 16,800. QA on 10% of that = 1,680 contracts × 2 min = 56 hours. Human-reviewed volume = 24,000 × 0.30 = 7,200 × 4 min = 480 hours. Rework on auto-approved errors below the 99% acceptance bar = 16,800 × (0.99 - 0.92) = 1,176 contracts × 8 min = 157 hours. Total post-AI human hours = 56 + 480 + 157 = 693 hours. Labour cost post-AI = 693 × $75 = $51,975. Add year-one TCO of $40,000. Net year-one savings = $150,000 - $51,975 - $40,000 = $58,025, roughly a 39% reduction. Payback ≈ 8 months.

The acceptance-bar sensitivity (the headline insight for the CFO)

Hold every other input constant and move only the acceptance bar. Drop the bar from 99% to 95%, and the rework volume collapses from 1,176 to 504 contracts. Year-one savings rise to roughly $65,000. Raise the bar to 99.5%, and the rework swamps the model. Year-one savings fall under $40,000. Same model, same automation rate, same vendor. The decision the business makes about the acceptance bar moves the bottom line by a third.

This is why the question 'what is your acceptance bar' has to be answered before the pilot starts, not after. The vendor cannot answer it for you. They do not know your malpractice exposure, your client expectations, or your appetite for risk on a particular workflow. If the conversation about the acceptance bar does not happen, the ROI model is silently wrong, and the pilot is being scoped against an unstated assumption that nobody will verify.

Three traps that will blow up your model

Ignoring rework cost when model quality is below the acceptance bar. The pilot reports '85% accuracy.' If the business needs 98%, every output has to be reviewed, and a non-trivial slice corrected. The savings line item flips negative once you price reviewer time. MIT NANDA traces a substantial share of the 95% failure rate to this gap between technical metrics (accuracy) and business metrics (P&L).
Double-counting labour savings. A 0.3 FTE reclaim on each of 5 analysts is 1.5 FTE of capacity, not 1.5 FTE of cost out. Unless someone is reassigned, redeployed, or not backfilled, no dollars leave the P&L. CFOs see through this in one meeting.
Omitting build, evaluation, and maintenance TCO. Pilot budgets routinely exclude prompt and eval engineering, ongoing eval-set maintenance, model-version churn, monitoring, drift remediation, and change management. Gartner's findings on production-deployment complexity are consistent: organisations underestimate it by a wide margin, and the underestimate compounds over the first year.

Sponsor-funded ROI numbers: read with caution

You will see vendor-funded studies citing 3.7x ROI per dollar invested (the IDC / Microsoft Business Opportunity of AI report, November 2024) or higher. These numbers are real, but they are sponsor-funded, self-reported by survey, and skewed by survivorship: the firms that abandoned their pilots did not respond. Use them as an industry anchor, not a planning baseline. The defensible numbers come from your own six-input model, run on your own workflow, with your own acceptance bar.

What to take to the CFO

Walk into the conversation with three things. One: a one-page model with the six inputs filled in, with each input cited (where the baseline came from, who set the acceptance bar, what the vendor's measured quality is on the holdout set). Two: a sensitivity table showing how the bottom-line savings move when the acceptance bar moves and when the automation rate moves. Three: an explicit statement of what would have to be true for the model to be wrong, and how you would detect that during the pilot.

If you can produce those three things, the pilot is worth funding regardless of what number falls out at the bottom. If you cannot, no number you produce is worth defending. The ROI model is the cheapest, fastest hedge against the failure modes that make 95% of GenAI pilots fail to ship measurable value.