Back to the PM Lab
AI & product management11 June 2026 · 9 min read Includes interactive tool

Discovery stays human, execution becomes the spec: how product managers work with AI in brownfield

Relevant phases
01Discover02Define03Build04OperateFast Lane

TL;DR

  • The line isn't “human or AI” but discovering versus executing: discovery and strategy stay human (Cagan / SVPG), execution belongs to AI agents – measured against a spec agreed up front (SDD, BMAD, OpenSpec).
  • Gates as a spec step don't cost speed, they save it. The evidence: METR measures a 19 per cent slowdown for experienced devs on unstructured AI tools, GitClear an eightfold rise in code duplication, DORA 7.2 per cent less delivery stability per 25 per cent more AI adoption.
  • The better the models get, the more the bottleneck shifts from coding to deciding what gets built – and defining it precisely. That is the product manager's core work; in regulated brownfield it's mandatory, not optional.

Key findings

  • “Vibe coding” works for prototypes and falls apart in production systems. That is exactly the gap spec-driven development closes.
  • The four frameworks solve different problems: SVPG teaches human teams how to think and discover. SDD/Spec Kit ensures execution quality. BMAD orchestrates specialised AI agents with an audit trail for enterprise complexity. OpenSpec delivers fast single-feature cycles with a living spec file.
  • Brownfield is more dangerous than greenfield for AI: undocumented constraints, implicit business rules, regression risk. This is precisely where code context and a gate before the code pay off.
  • The PM role shifts from requirements administrator to orchestrator: human discovery up front, AI execution at the back, precise definition as the link between them.

The moment we're in right now

AI coding tools are everywhere. Cursor, Claude Code, GitHub Copilot, Codex. Every developer has at least one of them open. The demos are impressive. One prompt and an app appears. The promise: ten times faster.

And then you land in enterprise reality. You're the CPO at a bank, an insurer, a medtech or industrial company in the DACH region. Your codebase isn't empty. It is fifteen years old, carries business rules nobody fully remembers any more, and it sits under regulation. Here, “let's see what the AI spits out” is not a strategy. Here, it's a risk.

That is the tension this article is about. The hype says: let the AI do it. The reality of regulated brownfield environments says: not without a plan. The good news is that these two worlds don't contradict each other. You just need to know where the human leads and where the machine executes.

The thesis: four phases for a brownfield project, from the product manager's point of view

Teklens thinks of the software lifecycle in four phases: Discover, Define, Build, Operate. Plus a fast lane for small changes. That isn't just a process map – it is the new way a product manager works in the AI era.

“Code intelligence” is the lever here. In plain words: your code, your Jira and your Confluence become a searchable, semantic index an AI agent understands. Not keyword search, but a genuine grasp of why a decision was made the way it was in 2023. That is exactly the context missing in brownfield – and without it, any AI is just guessing. Or as the team puts it: without context an LLM is an intern that guesses; with context it is a team member that thinks along.

Here is how the thesis maps onto the four phases:

Discover (human, SVPG). This is where you decide which problem is worth solving. Continuous discovery, customer interviews, Cagan's four risks: value, usability, feasibility, business viability. You do not delegate this phase to an agent. Cagan is explicit here: discovery is the core, and in the AI era it becomes more important, not less. In his words from “AI Product Management 2 Years In” (December 2024), with generative AI the PM role becomes “more essential and more difficult, not less”.

Define (the gate). This is where the spec is created. The validated problem becomes a precise, reviewed contract: what gets built, why, with which acceptance criteria, under which constraints. This spec is agreed before a single line of code exists. That is the gate. It is the most important new move in the PM toolkit.

Build (AI execution, SDD/BMAD/OpenSpec). Now the machine executes. The agent builds against the spec, not against a vague prompt. Which framework depends on the context: OpenSpec for lean single features, BMAD for heavyweight enterprise systems with an audit trail.

Operate. Operations, monitoring, regression protection. In brownfield this is the part that decides whether people trust the system. This is where it shows whether the change left the system stable.

Fast lane. Not every change needs the full cycle. A button, a copy fix, a small validation. That is what the fast lane is for: a light spec step, fast execution, still traceable.

Interactive tool

Try the cycle: four phases, one fast lane

DiscoverDefineBuildOperate
Teklens

Fast Lane · Small fixes cut straight across the circle.

Discover
Problem
Signals & knowledge scattered across 5+ systems and heads.
Teklens
Bundles and prioritises by value vs. effort & risk.
Result
A qualified roadmap, not a wishlist.

Hover over the dots on the circle to see each phase's problem, solution and result – the fast lane runs its quick laps on the inner ring.

Do gates really cost speed? The data says no

The most common objection to specs: they slow us down. The data says the opposite. Without structure, AI isn't faster – it is often slower.

The 2025 METR study is the hardest evidence here (Becker, Rush, Barnes, Rein, arXiv 2507.09089). A randomised controlled trial with 16 experienced open-source developers across 246 real tasks, using Cursor Pro and Claude 3.5/3.7 Sonnet. Beforehand, experts expected a 24 per cent speed-up; afterwards the developers still estimated a 20 per cent gain.

The actual result: “we find that allowing AI actually increases completion time by 19% – AI tooling slowed developers down.” 19 per cent slower. And they didn't even notice. They felt faster.

For its report “AI Copilot Code Quality 2025”, GitClear analysed 211 million changed lines of code from 2020 to 2024. Refactored (“moved”) lines fell from 24.1 to 9.5 per cent, copied code rose from 8.3 to 12.3 per cent, and duplicated blocks (five lines or more) rose eightfold in 2024.

Churn too – code changed again within two weeks – rose from 3.1 to 5.7 per cent. GitClear CEO Bill Harding puts it bluntly: “hastily added code is caustic to the teams expected to maintain it afterward.” More output, worse maintainability.

The 2024 DORA report (Accelerate State of DevOps, more than 39,000 respondents) closes the loop: for every 25 per cent increase in AI adoption, estimated delivery stability dropped by 7.2 per cent and throughput by 1.5 per cent. 39.2 per cent of respondents reported “little to no trust in AI-generated code”. According to DORA, the reason isn't “AI code is rubbish” but that AI makes larger change sets easier without discipline – and large change sets mean more risk.

The conclusion is not “less AI”. It is “more structure before the AI”. A gate that locks down the spec before the agent builds is exactly the discipline these studies call for. You don't lose speed – you prevent rework.

How structured specs become quality guardrails

Spec-driven development reverses the old order. For decades, code was king and the spec was disposable. SDD makes the specification the source of truth. GitHub sums it up with Spec Kit: “specifications don't serve code, code serves specifications.”

In practice that means: first the spec, then the plan, then small testable tasks, and only then the implementation by the agent. Spec Kit works with more than 30 AI coding agents and uses a “constitution” for non-negotiable project principles. That is the guardrail every agent aligns with, whether Copilot, Claude Code or Cursor. Spec Kit itself is explicitly not an 80-page requirements tome but a lean, living spec.

BMAD goes further. Instead of one agent, it orchestrates an ensemble of specialised personas: analyst, PM, architect, developer, QA, plus an orchestrator. Each persona produces a versioned artefact and hands over to the next. With Git versioning, every artefact becomes traceable – from the PRD through the architecture to the stories.

For regulated environments where code provenance matters, that is the decisive point: an audit trail showing who decided what, when and why. BMAD deliberately treats brownfield differently from greenfield – with a dedicated test-architect agent for regression risks, legacy dependencies and breaking changes. Importantly: BMAD doesn't replace Copilot, Cursor and friends – it orchestrates them.

OpenSpec by Fission-AI is the lightweight variant. One living spec, a cycle of explore, propose, apply, archive. Humans and AI agree on the spec before code is written (“Agree before you build”). No heavyweight phase-gate apparatus, but fluid iteration – and at the end the change lands in the archive for the audit history. Ideal when one developer plus one AI assistant is meant to ship a feature. OpenSpec deliberately positions itself as lighter than Spec Kit, which it describes as “thorough but heavyweight”.

The common denominator of all three: agreement before code. That is the difference between prompt engineering on the fly and a contract that execution is measured against.

Why brownfield in particular needs a gate before the code

Greenfield is easy for AI. An empty field, no legacy. Brownfield is dangerous. Established systems carry invisible contracts: assumptions about data flows, integration quirks, business rules in code nobody remembers. Agents often can't derive these, so they guess. Then things break quietly. One documented example: a team let Claude refactor an eight-year-old Django monolith and got clean, confident patches that silently broke integrations with external services. After several rollbacks they shelved the experiment.

Add to that the models' technical brownfield limits: context overflow on large files, forgetting across sessions, overlooked conventions. Exactly the issues a semantic code index and a spec fixed in writing address.

In regulated industries the compliance burden comes on top. Audit trails are mandatory under HIPAA, GDPR, 21 CFR Part 11 and industry-specific rules. Who changed what in a system, and when, must be reconstructable – even months later. A traceable spec-to-code path is not a nice-to-have here; it is part of your defensibility.

That is exactly why the combination of code intelligence and a gate fits DACH banks, insurers and medtech so well. First understand the code context, then lock down the spec, then let the agent build, then protect against regressions. Characterisation tests before the AI refactor, narrow diffs, one intent per pull request. That keeps the system predictable under load – and at the end there is code that runs.

The last and most important argument: the better the models, the more valuable the gates

Here comes the point that carries the whole thesis.

The models are getting better at a rapid pace. Claude Code now works autonomously over long stretches. In “Measuring AI agent autonomy in practice” (October 2025 to January 2026), Anthropic documents that the longest work segments almost doubled within three months – the 99.9th percentile moving from under 25 minutes to over 45 minutes. Agents run across multiple context windows, leave progress logs and work on tasks that used to take days. The auto-approve rate climbs from around 20 per cent for newcomers to over 40 per cent for experienced users.

The naive conclusion would be: soon we won't need specs at all, the AI will do everything. That is the wrong way round.

The more autonomously an agent builds over hours, the more expensive a wrong assumption becomes. An agent that runs in the wrong direction for an hour because the goal was unclear produces an hour of damage.

Anthropic itself shows it: a frontier model like Opus 4.5, given only a vague prompt such as “build a clone of claude.ai”, fails to reach production quality – it attempts too much at once and loses context mid-implementation. What saves it are an initialiser step, structured progress logs and incremental progress. In other words: structure and a gate.

That is the real shift. Once coding becomes cheap and fast, typing is no longer the bottleneck. The bottleneck is the decision about what should be built – and the precise definition of that decision. Andrew Ng puts it succinctly (January 2025): “Writing software, especially prototypes, is becoming cheaper. This will lead to increased demand for people who can decide what to build. AI Product Management has a bright future!” That is exactly the product manager's job.

Cagan argues in the same direction: the product-owner role as a pure backlog administrator is at risk (“a very easy kind of thing for an assistant to make a big dent into”), while the real product creator becomes more valuable. The more the AI builds, the more human judgement counts – about what is worth building, and how precisely that is defined.

Put differently: better models don't make gate-based, spec-driven PM work obsolete. They make it the most important work there is.

The new role of the product manager

The PM becomes an orchestrator. Human discovery at the front, AI execution at the back, the precise spec in between. You no longer manage requirement lists – you decide on outcomes and define them so clearly that an autonomous agent cannot misread them. That is the bridge between Cagan's human-centred model and the AI-native frameworks: SVPG principles for discovery and strategy, then hand-over to OpenSpec (for agile, lean apps) or BMAD (for heavyweight enterprise systems) for execution.

And that brings us back to the beginning. The hype says: let the AI do it. Enterprise reality says: not without a plan. The contradiction dissolves as soon as you divide the work correctly. The AI builds. You decide what – and define it so precisely that the machine doesn't guess. In brownfield, under regulation, with a codebase you can't throw away, that is not the slow option. It is the only one that holds up under load.

Recommendations

  • Start with the define gate now. Introduce an agreed spec for every non-trivial change before an agent builds. It is the cheapest lever with the biggest effect.
  • Choose the framework by context: OpenSpec for lean single features and fast cycles, BMAD for complex enterprise systems with compliance and audit requirements, Spec Kit when you work across tools with many agents.
  • Protect discovery. Never delegate the decision about which problem to solve to an agent. Keep Cagan's four risks (value, usability, feasibility, business viability) as a checklist in the discover phase.
  • Safeguard brownfield changes: Characterisation tests before the AI refactor, one intent per pull request, narrow diffs, a regression harness around critical flows such as billing, auth and pricing. Rule of thumb: no characterisation tests, no AI refactor in that area.
  • Build the audit trail in from the start, not retroactively. In regulated environments the spec-to-code path is part of compliance, not an edge case.
  • Thresholds that change your decision: If your change-failure rate or rework rate rises after introducing AI, that is the signal to tighten the gate, not to abolish it. If it doesn't rise and lead time falls, you can open the fast lane to more classes of change.

Scope & caveats

  • The METR study covered 16 developers in large, mature open-source repositories (averaging 22,000+ stars, 1m+ lines). The authors themselves stress that the result doesn't generalise to every context and that newer models may perform differently. It is strong evidence for mature brownfield codebases, not a universal verdict. The models tested (Claude 3.5/3.7) are now several generations old.
  • The DORA and GitClear figures are correlations and industry aggregates, not statements about your specific team. Use them as a warning signal, not a forecast.
  • SDD is not free. Critics such as Gojko Adzic warn that heavyweight spec processes could bring back the rigidity agile methods tried to escape. Some practitioners report that Spec Kit doesn't automatically lead to better code without additional rules and MCP servers. The answer is lean, living specs – not 80-page requirements tomes.
  • All the productivity and quality figures cited come from external sources (METR, GitClear, DORA, Anthropic) and are attributed to them. They are not Teklens results and not promises of specific speed gains.

The takeaway

Discovery stays human, execution becomes the spec – and the Teklens cycle of Discover, Define, Build, Operate gives that division of labour its frame. Introduce the define gate today and you have already set the most important lever for stable AI delivery in brownfield.

Matching use cases from the library

From the article straight into practice: these use cases put the concepts to work with Teklens.

Simon ScheurerMathias WegmüllerMarc Gasser
The lab letter

No new piece without you.

New articles, new interactive tools, new evidence – in your inbox first. And when you reply, we reply: you write directly with the authors, not with a no-reply.

No spam, no sharing, unsubscribe any time.

Ready to put a define gate in front of your agents?

Start a demo – Teklens connects specs, Jira and code into the planning software that knows your business.

No sales rep. A founder replies directly.