Prototyping11 June 2026 · 6 min read Includes interactive tool

The MVP for AI: test the wizard before you build the machine

Marc GasserSerial Founder · GTM & Marketing

PillarProduct Execution

Relevant phases

01Discover02Define03Build04OperateFast Lane

The MVP question for AI is not “can we build the model?” but “does user behaviour change once the answer is there?”. A human behind the curtain or a foundation model answers that question – long before custom ML burns money.
The Wizard-of-Oz method goes back to John F. Kelley's IBM experiments in 1984: users interact with a seemingly automatic system operated by a human. For GenAI products it is more relevant than ever – it tests value and UX at perfect “model quality”.
Three rungs before the custom model: Wizard of Oz (days), foundation model with prompt and your data (days to weeks), fine-tuning (weeks). Each rung answers a sharper question at a fraction of the next one's cost.

Key findings

Gartner's abandonment reasons for GenAI projects – escalating costs and unclear business value – are exactly the risks a Wizard-of-Oz test eliminates for a few hundred francs.
The biggest insight from a simulated MVP is often negative and still worth gold: users don't actually want the answer, don't trust it, or would need it elsewhere in the workflow.
Foundation models have shifted the custom-ML threshold drastically: training your own pays off only once prompting plus retrieval on real data demonstrably hits the quality ceiling.

The most expensive way to be wrong

The classic enterprise AI project arc: six months of collecting data, training the model, integrating – and then it turns out users try the function twice and never touch it again. The model was good. The hypothesis was wrong. Exactly this ordering is what makes AI projects expensive: the biggest uncertainty (does anyone want this?) gets tested last, the smallest (can we get a model working?) first.

MVP thinking flips the order. Eric Ries' definition – the smallest experiment that produces validated learning about customers – fits AI better than anything else, because simulation is so cheap here: a human can type the “AI answer”. John F. Kelley demonstrated it at IBM in 1984, testing a speech recogniser that didn't exist yet – a hidden operator team supplied the answers. The term for it: Wizard of Oz.

The three rungs before your own model

Rung 1: Wizard of Oz. A human supplies the answers, the interface pretends to be automatic. Tests in days whether the result changes behaviour – at perfect quality. If users ignore even perfect output, no model will save the feature.

Rung 2: foundation model. An LLM with a good prompt and retrieval over your real data. Tests whether machine quality gets close enough to the wizard – and produces your first eval set from real interactions as a by-product.

Rung 3: fine-tuning and custom ML. Only once rung 2 demonstrably hits the quality ceiling and the business case stands. Now the cost is justified – by data from real usage instead of slides.

Interactive tool

Wizard-of-Oz readiness: can you fake it before you build it?

The core question is phrased: which user behaviour do we want to prove?
A human could supply the AI answer behind the scenes (Wizard-of-Oz-able).
A foundation model with a good prompt roughly covers the use case – without custom training.
5 to 10 real users are reachable for the test.
The test's success metric is defined (usage, willingness to pay, time saved).
The test setup is honest about compliance: no real personal data in the experiment.

Your result0 of 6Too early

Without a core question and success metric your MVP measures nothing. Phrase the hypothesis first – building is the easy part.

Tick what applies to your AI initiative – the result shows whether you can start with a simulated MVP or groundwork is missing.

How to fake without deceiving

Wizard of Oz means simulated automation, not simulated ethics. Three rules keep the test clean: first, no real personal data in the experiment – the operator behind the curtain sees what users type. Second, keep response times realistic, or you validate a UX that will never exist. Third, debrief after the test, especially with B2B pilot customers: “that was a concept test” is a trust signal in DACH, not a loss of face.

And define up front what passing means: repeat usage in week two? Willingness to pay in the conversation? A concrete time saving? An MVP without a success metric produces anecdotes. With one, it produces a decision – build, rebuild or bury.

Recommendations

Ask the behaviour question first. Before any AI investment, phrase: which user behaviour must change for this to pay off? Then test exactly that – with the cheapest means.
Climb the rungs in order. Wizard of Oz before foundation model before fine-tuning. Every skipped rung is untested risk you pay for in engineering months.
Collect the eval set in the MVP. Every real interaction from rungs 1 and 2 is gold later: it becomes the test set you measure model quality against before anything goes live.
Bury without mourning. An MVP that refutes the hypothesis has done its job. Document the learning in the backlog – it is the cheapest protection against a rerun in two years.

Scope & caveats

Wizard of Oz doesn't scale and distorts on long, domain-deep answers: where the operator needs expert knowledge, you are testing the expert, not the product. In that case go straight to rung 2 with a narrow scope.
Foundation-model-first applies to typical language and knowledge use cases. For highly specialised domains (sensor data, medical imaging) custom ML may be needed earlier – then the experiment belongs in the roadmap's research lane.

The best AI MVP often contains no AI at all: it proves the value before the machine exists. Test the wizard before you build the machine and you only ever invest in features whose demand is already proven.

Matching use cases from the library

From the article straight into practice: these use cases put the concepts to work with Teklens.

FLFast LaneQuick devCondenses an unclear short request into a clear goal and delivers working code directly.View use case 02DefineSpec → epic with code contextTurns a spec into a build-ready epic with acceptance criteria from real code paths.View use case 02DefineRequirement feasibility checkChecks every requirement against the API, schema and test surface of the codebase — before anything is built.View use case

The lab letter

No new piece without you.

New articles, new interactive tools, new evidence – in your inbox first. And when you reply, we reply: you write directly with the authors, not with a no-reply.

No spam, no sharing, unsubscribe any time.