Back to the PM Lab
Agile & delivery11 June 2026 · 6 min read Includes interactive tool

Agile for AI: sprints that survive model training

Relevant phases
01Discover02Define03Build04OperateFast Lane

TL;DR

  • The core conflict: Scrum plans deliveries, ML produces findings. A model can be worse after two weeks of work than before – that is not a broken sprint, it is an experiment result. Processes must carry both.
  • The solution is not a new method but a clean separation: deterministic product work in stories with commitment, model work in time-boxed spikes with a learning goal, quality threshold and kill criterion.
  • Process models such as CRISP-ML(Q) and Continuous Delivery for Machine Learning (CD4ML, Sato/Wider/Windheuser) supply the missing phases – data understanding, evaluation, monitoring – that classic Scrum simply doesn't know.

Key findings

  • Teams that estimate model work as stories with story points are really estimating research duration – and systematically produce broken commitments and stakeholder distrust.
  • The definition of done is the most effective lever: for AI features, done means “reaches the quality threshold on the eval set”, not “code merged, demo worked”.
  • The DORA finding from the AI context applies here too: speed without discipline lowers stability. Small change sets and evaluation before merge are the answer, not longer sprints.

Where Scrum breaks on model work

Scrum works because software effort is roughly plannable: understand the story and you can estimate it. Model work breaks that coupling. Whether a fine-tune reaches the quality threshold, whether a RAG setup answers precisely enough on your data – you only know after the attempt. Two weeks of work can deliver zero visible progress and still be valuable because they rule out a dead end.

Process models from the ML world fill the gap Scrum leaves open: CRISP-ML(Q) (Studer et al.) describes the lifecycle from business understanding through data preparation to monitoring – with quality assurance in every phase. Continuous Delivery for Machine Learning (CD4ML, described by Sato, Wider and Windheuser on martinfowler.com) transfers CD principles to models: small steps, automated pipelines, reproducible releases. Neither replaces your agile framework – they name the work your framework has so far made invisible.

Spikes instead of stories: how to separate learning from delivering

Model work as a spike. Time-box instead of estimate, learning goal instead of acceptance criteria: “in five days we know whether approach A reaches the 90 per cent threshold on the eval set.” The spike succeeds when it answers the question – in either direction.

Product work as a story. UI, data plumbing, logging, guardrails, fallbacks – everything around the model is deterministic and belongs in normal stories with a normal commitment. That way the sprint shows visible progress even when the experiment fails.

Data as separate items. Sourcing, cleaning, labelling data and building an eval set is often half the effort – CRISP-ML(Q) dedicates whole phases to it. Hide it inside “build model” and factor-two surprises follow.

Interactive tool

The ML readiness check for your sprint setup

Your result0 of 7Friction loss

Your process treats experiments like deliveries – that produces broken sprints and frustrated teams. Start with spikes and quality thresholds.

Tick what already applies in your delivery process – the result shows whether your setup can carry model work or grinds it down.

The definition of done AI features need

With deterministic code, tests prove correctness. With probabilistic features they only prove the frame – the real question is: how good is it on representative cases? That is why the definition of done needs an eval step: a defined test set, a defined metric, a defined threshold. A review that only shows a demo on three hand-picked examples tests charisma, not quality.

And because models change with data, prompts and model versions, the eval pipeline is not a one-off artefact but part of CI – that is the core of CD4ML. Every prompt change runs against the same eval set the way every code change runs against the tests. That keeps “it got better” a measurement instead of an opinion, sprint after sprint.

Recommendations

  • Split the backlog into delivering and learning. Stories with commitment for everything deterministic, spikes with time-box, learning goal and kill criterion for model work. Never mix them in the same commitment.
  • Define good enough before the spike. Quality threshold and eval set exist before anyone trains or prompts. Otherwise the goal moves with every result.
  • Extend the definition of done. No AI feature is done without a passed evaluation and an active monitoring signal. “The demo worked” is not a done criterion.
  • Make learning visible. Report both currencies in review: shipped product progress and answered research questions. Stakeholders accept negative experiment results – but not invisible ones.

Scope & caveats

  • CRISP-ML(Q) and CD4ML are process models, not guarantees – they structure work whose outcome stays open. Adopt the phases, not the bureaucracy: a two-pizza team needs the discipline, not the paperwork.
  • With foundation models and prompting instead of custom training, the experimental share often shrinks considerably – sometimes a lean eval pipeline suffices. First check how much research your feature actually contains.

The takeaway

Agile doesn't die of AI – it dies of mixing delivering with learning. Build spikes, quality thresholds and eval pipelines into the cycle and you get both: reliable delivery and honest experiments.

Matching use cases from the library

From the article straight into practice: these use cases put the concepts to work with Teklens.

Simon ScheurerMathias WegmüllerMarc Gasser
The lab letter

No new piece without you.

New articles, new interactive tools, new evidence – in your inbox first. And when you reply, we reply: you write directly with the authors, not with a no-reply.

No spam, no sharing, unsubscribe any time.

Ready to put a define gate in front of your agents?

Start a demo – Teklens connects specs, Jira and code into the planning software that knows your business.

No sales rep. A founder replies directly.