[Spec Kit Part 6] Implement & Converge — Execution, Convergence, and Proving You're Done
Turning a well-decomposed task list into real code with /speckit.implement, bridging tasks to GitHub issues with /speckit.taskstoissues, and answering 'are we actually done?' with /speckit.converge — how to trust AI-written code, followed end to end on the dq-monitor example.
In Part 5 we designed dq-monitor's architecture in plan.md, decomposed it into a dependency-ordered task list (tasks.md), and ran /speckit.analyze to filter out contradictions across spec, plan, and tasks. What we hold now is a "verified blueprint." But a blueprint is not a building. Part 6 asks one question — when AI turns these tasks into real code, how do we trust that code? And a harder one: what proves that "we're done" is actually true?
What you'll learn in this post
- How
/speckit.implementexecutestasks.mdin dependency order to produce code, and where the human must step in- SDD's answer to "how do you trust AI-written code?" — traceability from code → task → requirement → constitution
- How
/speckit.taskstoissuesmoves the task list into GitHub issues, bridging SDD with your team's existing workflow (issues, PRs, boards)- How
/speckit.convergeassesses the codebase against the artifacts to recover residual work and close the question of "are we actually done?"- Trust and verification practices, plus the common pitfall of mistaking implement for the finish line
This is Part 6 of the Spec Kit series. It assumes the plan.md and tasks.md for dq-monitor from Part 5 are ready.
1. Implement Is Not "Fire and Forget"
The definition of /speckit.implement is simple — execute all tasks to build the feature according to the plan. The AI agent reads tasks.md top to bottom and turns each task into code in dependency order. T001 builds the data model, T002 writes the checks engine that uses it, alert routing layers on top of that, and so on.
First, a common misconception worth breaking. implement is not the "hit enter and go grab coffee" command. It looks like autopilot, but in SDD the core value of implement is that code flows out in human-reviewable units. Because tasks are split small, the AI produces one task at a time, and the human reviews one task at a time. Keeping the verification loop tight — that is the human's job in the implement phase.
"How Do You Trust AI-Written Code?"
The question running through this series reaches its peak here. Vibe coding could not answer it. Even when the code looked plausible, there was no baseline to check whether it satisfied the original intent (the "unverifiability" failure mode from Part 1).
SDD's answer lies not in the quality of the code itself but in traceability.
Every line of code traces back to a task, every task traces back to a requirement, and every requirement sits under a principle in the constitution.
With this chain in place, "do you trust this code?" becomes a verifiable question: which task does this code implement, which requirement does that task satisfy, and does that requirement conform to our principles? Trust comes not from a vague feeling but from a path you can follow.
| Question | Vibe coding | SDD (implement) |
|---|---|---|
| Why is this code here? | "The AI wrote it that way" | Implements T014 → requirement FR-007 |
| Does it behave correctly? | Looks fine, I think | The test named in the task passes |
| Did we miss anything? | No way to know | Tracked via tasks.md checkboxes |
| Did it break a principle? | No baseline to compare | Checked against constitution gates |
2. Building dq-monitor Module by Module
Part 5's tasks.md arranged dq-monitor's work in dependency order. When you invoke /speckit.implement, the agent follows that order and stacks the modules up. The actual flow looks roughly like this.
$ /speckit.implement
▶ Loading specs/001-realtime-dq-monitor/tasks.md — 27 open tasks
[Phase 1] Foundation
T001 Data models: DataSource, Metric, Threshold, Alert ........ built → tests pass ✓
T002 Config loader (YAML) + schema validation ................. built → tests pass ✓
⏸ Checkpoint: awaiting human review (Phase 1 diff)
[Phase 2] Checks Engine
T007 Freshness check rule ..................................... built → tests pass ✓
T008 Completeness / null-rate check rule ...................... built → tests pass ✓
T009 Anomaly (z-score) check rule ............................. built → tests pass ✓
T010 Check scheduler (periodic run) ........................... built → tests pass ✓
⏸ Checkpoint: awaiting human review (Phase 2 diff)
[Phase 3] Alert Routing
T014 Alert router (severity → channel mapping) ................ built → tests pass ✓
T015 Slack / Email delivery adapters .......................... built → tests pass ✓
...Each ⏸ Checkpoint is not an accident. Every time one layer (phase) of the dependency graph finishes, execution pauses so the human can review that layer's diff all at once. If the data model is wrong, you want to catch it before the entire checks engine is built on top of it. Stack higher layers on a flawed foundation, and the cost grows exponentially.
Watching One Task Become Code
The abstract flow only goes so far, so let's look at how a single task actually becomes code and a test. Take T007 — Freshness check rule. Its original entry in tasks.md read:
- [ ] T007 [P] Implement freshness check rule
- Input: DataSource.last_updated_at, Threshold.max_staleness_minutes
- Output: CheckResult(status: OK|WARN|FAIL, observed, expected)
- Must satisfy FR-003 (alert when freshness threshold exceeded)
- Edge cases: last_updated_at == now, last_updated_at is nullWhen the agent implements this task, the following code and test emerge. Note that the requirement ID (FR-003) and the edge cases written in the task carry straight through into the code and tests.
# src/dq_monitor/checks/freshness.py
from datetime import datetime, timezone
from dq_monitor.models import CheckResult, CheckStatus, DataSource, Threshold
def check_freshness(source: DataSource, threshold: Threshold, *, now: datetime) -> CheckResult:
"""Compare a data source's freshness against its threshold. (FR-003)"""
if source.last_updated_at is None:
# Edge case: no update history is treated as FAIL.
return CheckResult(
status=CheckStatus.FAIL,
observed=None,
expected=threshold.max_staleness_minutes,
detail="last_updated_at is null",
)
staleness_min = (now - source.last_updated_at).total_seconds() / 60.0
if staleness_min > threshold.max_staleness_minutes:
status = CheckStatus.FAIL
elif staleness_min > threshold.max_staleness_minutes * 0.8:
status = CheckStatus.WARN # pre-emptive warning past 80% of threshold
else:
status = CheckStatus.OK
return CheckResult(
status=status,
observed=round(staleness_min, 2),
expected=threshold.max_staleness_minutes,
)# tests/checks/test_freshness.py
from datetime import datetime, timedelta, timezone
from dq_monitor.checks.freshness import check_freshness
from dq_monitor.models import CheckStatus, DataSource, Threshold
NOW = datetime(2026, 6, 26, 12, 0, tzinfo=timezone.utc)
THRESHOLD = Threshold(max_staleness_minutes=60)
def _source(updated_at):
return DataSource(name="orders", last_updated_at=updated_at)
def test_fresh_data_is_ok():
r = check_freshness(_source(NOW - timedelta(minutes=10)), THRESHOLD, now=NOW)
assert r.status == CheckStatus.OK
def test_stale_data_fails(): # FR-003 core path
r = check_freshness(_source(NOW - timedelta(minutes=90)), THRESHOLD, now=NOW)
assert r.status == CheckStatus.FAIL
assert r.observed == 90.0
def test_near_threshold_warns():
r = check_freshness(_source(NOW - timedelta(minutes=55)), THRESHOLD, now=NOW)
assert r.status == CheckStatus.WARN
def test_null_last_updated_fails(): # edge case spelled out in tasks.md
r = check_freshness(_source(None), THRESHOLD, now=NOW)
assert r.status == CheckStatus.FAIL$ pytest tests/checks/test_freshness.py -q
.... [100%]
4 passed in 0.06sThis small cycle — task → code → test → pass — repeated 27 times becomes dq-monitor. And the test in each cycle is the gate. If the test goes red, you do not move on to the next task. Even when the AI says "all done," it is not done if it does not clear the gate.
Key point: in the implement phase the human's job is not to re-write code line by line, but to review in task-sized chunks, treat tests as gates, and intervene only on the tasks that drift. The smaller the review unit, the greater the trust.
3. taskstoissues — Bridging SDD into the Team's Daily Workflow
tasks.md is an excellent execution plan, but as long as it lives in a single text file it stays apart from the team's day-to-day. Most teams already slice work into GitHub issues, review via PRs, and track progress on project boards. /speckit.taskstoissues closes that gap — it converts the task list into GitHub issues for tracking.
This is not a simple copy-paste but a bridge between SDD and your existing collaboration tools. Once each task becomes an issue, it gains assignees, labels, milestones, and a board column, and a PR can close it with Closes #N. For teams already working on GitHub, it is an utterly natural connection.
A generated issue looks roughly like this. The key is that the issue body back-references the original task and the spec/plan that task points to.
Issue #14 — [T014] Alert router: severity → channel mapping
Labels: spec-kit, phase-3-alert-routing, task
Milestone: dq-monitor v1
---
## Task
Implement a router that routes check-result severity (WARN/FAIL) to alert channels.
## Source (Traceability)
- Task: `specs/001-realtime-dq-monitor/tasks.md` → T014
- Requirement: `spec.md` → FR-007 (per-severity channel branching)
- Design: `plan.md` → §4.2 alert pipeline
## Definition of Done
- [ ] WARN → #data-quality Slack channel
- [ ] FAIL → #data-quality + on-call email
- [ ] Suppress duplicate alerts within 5 minutes (dedup)
- [ ] Unit tests: 4 severity routing branches
## Dependencies
- Blocked by: #10 (check scheduler), #13 (Alert model)Issues built this way combine SDD's traceability with GitHub's visibility. Glance at the board and you see "Phase 3 alert routing: 2 of 5 left"; open any issue and it records why the work is needed (which FR it satisfies). Managers get progress, developers get context — at the same time.
| Daily tool | What taskstoissues fills in |
|---|---|
| Issue tracker | Each task = a trackable issue (assignee, label, milestone) |
| Pull Request | Closes #14 links code to task |
| Project board | Phase labels become columns / swimlanes |
| Standup | Report "which FR is blocked" at the issue level |
taskstoissues is an optional step. For a quick solo iteration,
tasks.mdalone is enough. But when a team collaborates on GitHub and has to report progress outward, this bridge dissolves SDD from a "separate ritual" into "the work we were already doing."
4. Converge — The Last Question That Closes "Are We Actually Done?"
Once implement finishes and every test is green, are we done? Here lies SDD's subtlest and most valuable step. /speckit.converge assesses the codebase against the artifacts (spec, plan, tasks) and appends remaining work as new tasks.
Why is this step needed? Because subtle drift accumulates during implement. As you implement tasks, an edge case from the spec quietly slips out, a non-functional requirement from the plan (e.g., "p95 latency under 200ms") never makes it into code, and a stopgap (# TODO: retry logic) lingers. Every individual task's test passes, yet viewed across the whole spec, there are gaps.
converge asks not "did each task pass?" but "did everything the spec promised actually make it into the code?" And rather than just reporting the gaps and stopping, it recovers them as new tasks in tasks.md. That is how it closes the loop back toward the spec.
$ /speckit.converge
Assessing codebase ↔ artifacts...
spec.md: 12 functional requirements (FR), 4 non-functional (NFR)
plan.md: 6 design decisions
tasks.md: 27 tasks (27 complete)
code: 41 modules, 38 test files
── Convergence Report ──────────────────────────────────
✓ FR-001..FR-012 implemented and tested
✓ NFR-001 (check interval ≤ 1 min) confirmed in scheduler
⚠ 3 drifts found → recovered as new tasks:
[NEW] T028 NFR-002 unmet: no measurement/instrumentation of alert p95 latency
basis: plan.md §5 states "p95 < 200ms", no metric instrumentation in code
[NEW] T029 FR-009 partial: alert dedup applied to Slack only, email missing
basis: spec.md FR-009 "5-min dedup on all channels", not applied to email adapter
[NEW] T030 residual TODO: src/.../retry.py:42 backoff not implemented
basis: plan.md §4.2 designs "exponential backoff retry", incomplete
Added 3 tasks to tasks.md. Continue with /speckit.implement.
────────────────────────────────────────────────────────The question this report answers is exactly "are we really done?" It answers not with memory or a "feeling" but by comparing the code against the objective baseline of the spec. And the residual tasks it finds (T028–T030) become inputs to implement again. This small loop runs until the gap count reaches zero.
converge is not a step for "finding bugs" but for "aligning the promise with the result." Each test can pass and the result can still diverge from the spec as a whole — catching that divergence and feeding it back into the spec is how SDD proves completion.
5. The Implement–Converge Loop at a Glance
Here is the whole picture of how implement, converge, and taskstoissues mesh together. The key is that it is one closed loop. As long as converge finds gaps, we return to implement.
In words, the loop is:
implementturnstasks.mdtasks into code and tests (task-level review, test gates).- Optionally,
taskstoissuesexports tasks to GitHub issues, wiring them to the team's PRs and board. convergeassesses the finished code against spec, plan, and tasks.- If there are gaps, they are recovered as new tasks appended to
tasks.md→ back to step 1. - When the gap count hits zero, the work is done with the code proven to satisfy the spec.
This loop redeems, at the end, the promise of the "verifiable spec" from Part 1. Done is not "the moment the AI declares it" but "the moment spec and code are aligned."
6. Trust & Verification Practices — and Pitfalls to Avoid
Running implement and converge well ultimately comes down to a few habits. Where the tooling does not enforce discipline, the human must.
Practices to keep
| Habit | Why it matters |
|---|---|
| Small, reviewable units | Small tasks make small diffs, and small diffs actually get reviewed. One or two tasks per PR. |
| Tests as gates | Not "looks like it passes" but "the test passes." Don't stack the next task on a red test. |
| Maintain traceability | Put the task ID and FR in commit/PR messages so the code → task → requirement path never breaks. |
| Don't skip converge | Even when every implement test passes, it isn't "done" until converge. |
Common pitfalls
- Blindly accepting bulk output — when
implementdumps 20 tasks at once and you scroll past "because the diff is long" and approve, traceability becomes decoration. Traceability you don't review is not traceability. - Implementation without tests — no tests means no gate, and no gate means converge has no objective pass criterion to compare against. You slide back to "seems to work."
- Mistaking implement for the end — the most common and most expensive pitfall. Every task checkbox being filled is not the end. Tasks can pass individually while the spec as a whole still has gaps, and converge is the step that closes them. Stop at implement and you quietly slide back into Part 1's "unverifiability."
One line: trust is not "believing the AI" but "making the path verifiable." Small units, test gates, unbroken traceability, and converge — these four are what let AI-written code reach production.
Wrapping up
In Part 6 we turned the blueprint into a building. With /speckit.implement we moved dq-monitor's tasks into code and tests in dependency order — not "fire and forget" but reviewed in task-sized units with tests as gates. With /speckit.taskstoissues we wired those tasks to GitHub issues, PRs, and boards, dissolving SDD into the team's daily flow. And with /speckit.converge we compared code against the spec, recovered residual work, and closed the question of "are we actually done?"
The core never changes — the secret to trusting AI-written code is not a smarter model but a traceable path from code to task, to requirement, to constitution. Done is not a declaration but a proof.
In Part 7, the finale, we bring every step we've spread out so far into one. We'll close the series with an end-to-end case study building dq-monitor from zero to one — constitution through convergence — and a retrospective on the pitfalls and lessons we actually hit along the way.
References