Agile & delivery11 June 2026 · 6 min read Includes interactive tool

Agile for AI: sprints that survive model training

Marc GasserSerial Founder · GTM & Marketing

PillarProduct Execution

Relevant phases

01Discover02Define03Build04OperateFast Lane

The core conflict: Scrum plans deliveries, ML produces findings. A model can be worse after two weeks of work than before – that is not a broken sprint, it is an experiment result. Processes must carry both.
The solution is not a new method but a clean separation: deterministic product work in stories with commitment, model work in time-boxed spikes with a learning goal, quality threshold and kill criterion.
Process models such as CRISP-ML(Q) and Continuous Delivery for Machine Learning (CD4ML, Sato/Wider/Windheuser) supply the missing phases – data understanding, evaluation, monitoring – that classic Scrum simply doesn't know.

Key findings

Teams that estimate model work as stories with story points are really estimating research duration – and systematically produce broken commitments and stakeholder distrust.
The definition of done is the most effective lever: for AI features, done means “reaches the quality threshold on the eval set”, not “code merged, demo worked”.
The DORA finding from the AI context applies here too: speed without discipline lowers stability. Small change sets and evaluation before merge are the answer, not longer sprints.

Where Scrum breaks on model work

Scrum works because software effort is roughly plannable: understand the story and you can estimate it. Model work breaks that coupling. Whether a fine-tune reaches the quality threshold, whether a RAG setup answers precisely enough on your data – you only know after the attempt. Two weeks of work can deliver zero visible progress and still be valuable because they rule out a dead end.

Process models from the ML world fill the gap Scrum leaves open: CRISP-ML(Q) (Studer et al.) describes the lifecycle from business understanding through data preparation to monitoring – with quality assurance in every phase. Continuous Delivery for Machine Learning (CD4ML, described by Sato, Wider and Windheuser on martinfowler.com) transfers CD principles to models: small steps, automated pipelines, reproducible releases. Neither replaces your agile framework – they name the work your framework has so far made invisible.

Spikes instead of stories: how to separate learning from delivering

Model work as a spike. Time-box instead of estimate, learning goal instead of acceptance criteria: “in five days we know whether approach A reaches the 90 per cent threshold on the eval set.” The spike succeeds when it answers the question – in either direction.

Product work as a story. UI, data plumbing, logging, guardrails, fallbacks – everything around the model is deterministic and belongs in normal stories with a normal commitment. That way the sprint shows visible progress even when the experiment fails.

Data as separate items. Sourcing, cleaning, labelling data and building an eval set is often half the effort – CRISP-ML(Q) dedicates whole phases to it. Hide it inside “build model” and factor-two surprises follow.

Interactive tool

The ML readiness check for your sprint setup

Model work runs as a time-boxed spike with a learning goal, not a story with an estimate.
Every ML initiative has a measurable quality threshold (a definition of good enough).
Experiments have a kill criterion that actually gets enforced.
Data sourcing and preparation are separate backlog items with owners.
The definition of done includes evaluation, not just merged code.
Deterministic product work and model experiments don't block each other in the same sprint commitment.
Review shows metrics on eval sets, not just demos on feel-good examples.

Your result0 of 7Friction loss

Your process treats experiments like deliveries – that produces broken sprints and frustrated teams. Start with spikes and quality thresholds.

Tick what already applies in your delivery process – the result shows whether your setup can carry model work or grinds it down.

The definition of done AI features need

With deterministic code, tests prove correctness. With probabilistic features they only prove the frame – the real question is: how good is it on representative cases? That is why the definition of done needs an eval step: a defined test set, a defined metric, a defined threshold. A review that only shows a demo on three hand-picked examples tests charisma, not quality.

And because models change with data, prompts and model versions, the eval pipeline is not a one-off artefact but part of CI – that is the core of CD4ML. Every prompt change runs against the same eval set the way every code change runs against the tests. That keeps “it got better” a measurement instead of an opinion, sprint after sprint.

Recommendations

Split the backlog into delivering and learning. Stories with commitment for everything deterministic, spikes with time-box, learning goal and kill criterion for model work. Never mix them in the same commitment.
Define good enough before the spike. Quality threshold and eval set exist before anyone trains or prompts. Otherwise the goal moves with every result.
Extend the definition of done. No AI feature is done without a passed evaluation and an active monitoring signal. “The demo worked” is not a done criterion.
Make learning visible. Report both currencies in review: shipped product progress and answered research questions. Stakeholders accept negative experiment results – but not invisible ones.

Scope & caveats

CRISP-ML(Q) and CD4ML are process models, not guarantees – they structure work whose outcome stays open. Adopt the phases, not the bureaucracy: a two-pizza team needs the discipline, not the paperwork.
With foundation models and prompting instead of custom training, the experimental share often shrinks considerably – sometimes a lean eval pipeline suffices. First check how much research your feature actually contains.

Agile doesn't die of AI – it dies of mixing delivering with learning. Build spikes, quality thresholds and eval pipelines into the cycle and you get both: reliable delivery and honest experiments.

Matching use cases from the library

From the article straight into practice: these use cases put the concepts to work with Teklens.

03BuildLive sprint progress & risk flagsMakes sprint progress and risks visible in real time — daily course corrections instead of waiting for the retro.View use case 03BuildReadiness checkerScores and improves the implementation readiness of tickets before the sprint starts.View use case 03BuildTicket organiserMoves tickets into sprints and assigns them to epics automatically — roadmap and Jira stay in sync.View use case 03BuildEstimation with risk driversEstimates with LOC, complexity and the dependency graph — and names the reason behind every number.View use case

The lab letter

No new piece without you.

New articles, new interactive tools, new evidence – in your inbox first. And when you reply, we reply: you write directly with the authors, not with a no-reply.

No spam, no sharing, unsubscribe any time.