Model drift: when your product quietly degrades
Marc GasserSerial Founder · GTM & MarketingConnects AI with revenue operations and builds autonomous GTM systems for predictable growth.TL;DR
- The study “Temporal quality degradation in AI models” (Vela et al., Scientific Reports 2022; Harvard, MIT, University of Monterrey, Cambridge) found measurable quality decay over time in 91 per cent of 128 model–dataset pairs. Degradation is the norm, not the exception.
- Drift is a product problem, not purely an ML problem: the world changes (data drift), meaning changes (concept drift), and with bought-in LLMs the provider changes the model under your feet. The PM needs signals, thresholds and a response path.
- The minimum setup has three parts: a fixed eval set running regularly against production; user feedback as a live signal; and a rollback path with clear ownership. Retraining needs triggers, not a calendar.
Key findings
- Vela et al. also show the opposite of intuition: some models capture drifting processes well and barely age – so degradation is measurable and manageable, but not predictable without monitoring.
- LLM features add a new drift source: silent model updates by the provider and creeping prompt changes in your own team. Versioning model and prompt per response is therefore mandatory, not optional.
- Drift eats trust faster than functionality: users forgive a feature that's missing – but not one that was better last month. That is why drift belongs in product review, not just the ops dashboard.
Why models age although nobody changes anything
Classic software is as good tomorrow as today, as long as nobody touches it. AI features are not: they are trained on a snapshot of the world, and the world moves on. New product names, changed customer behaviour, different ticket topics – inputs wander away from the training state (data drift), or the meaning behind them flips (concept drift): what counted as “urgent” in 2024 is routine in 2026.
How widespread this is was quantified by the study “Temporal quality degradation in AI models” (Vela et al., Scientific Reports 2022): across 128 combinations of four model types and 32 datasets from healthcare, transport, finance and weather, 91 per cent of cases showed quality decay over time – the authors call it AI ageing. The remainder is remarkable: some models barely age. No spec sheet tells you whether yours is one of them – only measurement does.
Monitoring a PM can steer
The eval set as the fixed star. A frozen set of representative cases that runs weekly against the live system. If the score falls, it is not your taste that changed but the system. It is the most objective drift measurement – and costs one afternoon of setup.
User signals as early warning. Thumbs down, manual corrections, retry rates, mid-flow abandonment. Noisy individually, incorruptible in trend – rising correction rates often show drift weeks before the eval set, because real usage is broader than any test set.
Versions per response. Model version, prompt version, data state – attached to every response. Without that trail you cannot attribute a quality dip: was it the provider's silent update, Tuesday's prompt tweak, or the world?
The drift monitoring check for your AI feature
Without an eval set and thresholds you learn about drift from support tickets. Start with the eval set – it is half the battle.
Tick what already exists for your AI feature in production – the result shows whether you would notice drift before your customers do.
From alert to action: thresholds, rollback, retraining
Monitoring without a response path is decoration. Define two thresholds per core metric: amber (“observe, clarify cause”) and red (“intervene”). Behind red, the first move is rollback – to the last good model–prompt combination, in minutes not days. Only then comes diagnosis: inspect eval details, cluster cases, determine the drift's cause.
Retraining and prompt revision are then product decisions with cost and risk – treat them like features: the trigger is a breached threshold plus root-cause analysis, not a quarterly calendar. And in regulated environments this closes the loop to the EU AI Act: logging, versioning and documented interventions are exactly the artefacts high-risk systems must produce anyway.
Recommendations
- Freeze an eval set – today. 50 to 200 representative cases with expected results, automatically run against production weekly. Without a fixed star, every drift discussion is a matter of taste.
- Log versions per response. Model, prompt, data state. It is one line of code when building and a weekend of archaeology when retrofitting.
- Build the rollback before the retraining. The fastest answer to drift is the last good configuration. Retraining is the second answer – with a trigger, a budget and eval proof.
- Bring drift into product review. One chart, three numbers: eval score, correction rate, open drift alerts. What is in review gets prioritised – what rots in the ops dashboard does not.
Scope & caveats
- Vela et al.'s 91 per cent comes from classic ML models (including random forests and neural networks) on tabular datasets, not LLM products. The mechanics – world drifts, quality falls – transfer; the specific rate is context-dependent.
- User feedback is a biased signal: the dissatisfied click more than the satisfied, and power users dominate. Use trends rather than absolutes and always combine with the eval set.
The takeaway
An AI feature is never finished – it is merely good right now. Build the eval set, version logging and rollback into the operate phase and you notice the quiet decay before your customers do, turning drift into maintenance instead of a breach of trust.
Matching use cases from the library
From the article straight into practice: these use cases put the concepts to work with Teklens.



No new piece without you.
New articles, new interactive tools, new evidence – in your inbox first. And when you reply, we reply: you write directly with the authors, not with a no-reply.
No spam, no sharing, unsubscribe any time.

