spec-kitspec-driven-developmentimplementationgithub-issuesclaude-codeai

[Spec Kit Part 6] Implement & Converge — Execution, Convergence, and Proving You're Done

Turning a well-decomposed task list into real code with /speckit.implement, bridging tasks to GitHub issues with /speckit.taskstoissues, and answering 'are we actually done?' with /speckit.converge — how to trust AI-written code, followed end to end on the dq-monitor example.

Data DynamicsJune 16, 202614 min read

In Part 5 we designed dq-monitor's architecture in plan.md, decomposed it into a dependency-ordered task list (tasks.md), and ran /speckit.analyze to filter out contradictions across spec, plan, and tasks. What we hold now is a "verified blueprint." But a blueprint is not a building. Part 6 asks one question — when AI turns these tasks into real code, how do we trust that code? And a harder one: what proves that "we're done" is actually true?

What you'll learn in this post

How /speckit.implement executes tasks.md in dependency order to produce code, and where the human must step in

SDD's answer to "how do you trust AI-written code?" — traceability from code → task → requirement → constitution

How /speckit.taskstoissues moves the task list into GitHub issues, bridging SDD with your team's existing workflow (issues, PRs, boards)

How /speckit.converge assesses the codebase against the artifacts to recover residual work and close the question of "are we actually done?"

Trust and verification practices, plus the common pitfall of mistaking implement for the finish line

This is Part 6 of the Spec Kit series. It assumes the plan.md and tasks.md for dq-monitor from Part 5 are ready.

1. Implement Is Not "Fire and Forget"

The definition of /speckit.implement is simple — execute all tasks to build the feature according to the plan. The AI agent reads tasks.md top to bottom and turns each task into code in dependency order. T001 builds the data model, T002 writes the checks engine that uses it, alert routing layers on top of that, and so on.

First, a common misconception worth breaking. implement is not the "hit enter and go grab coffee" command. It looks like autopilot, but in SDD the core value of implement is that code flows out in human-reviewable units. Because tasks are split small, the AI produces one task at a time, and the human reviews one task at a time. Keeping the verification loop tight — that is the human's job in the implement phase.

Loading diagram…

"How Do You Trust AI-Written Code?"

The question running through this series reaches its peak here. Vibe coding could not answer it. Even when the code looked plausible, there was no baseline to check whether it satisfied the original intent (the "unverifiability" failure mode from Part 1).

SDD's answer lies not in the quality of the code itself but in traceability.

Every line of code traces back to a task, every task traces back to a requirement, and every requirement sits under a principle in the constitution.

With this chain in place, "do you trust this code?" becomes a verifiable question: which task does this code implement, which requirement does that task satisfy, and does that requirement conform to our principles? Trust comes not from a vague feeling but from a path you can follow.

Question	Vibe coding	SDD (implement)
Why is this code here?	"The AI wrote it that way"	Implements `T014` → requirement `FR-007`
Does it behave correctly?	Looks fine, I think	The test named in the task passes
Did we miss anything?	No way to know	Tracked via `tasks.md` checkboxes
Did it break a principle?	No baseline to compare	Checked against constitution gates

2. Building dq-monitor Module by Module

Part 5's tasks.md arranged dq-monitor's work in dependency order. When you invoke /speckit.implement, the agent follows that order and stacks the modules up. The actual flow looks roughly like this.

$ /speckit.implement
 
▶ Loading specs/001-realtime-dq-monitor/tasks.md — 27 open tasks
 
[Phase 1] Foundation
  T001  Data models: DataSource, Metric, Threshold, Alert ........ built → tests pass ✓
  T002  Config loader (YAML) + schema validation ................. built → tests pass ✓
  ⏸  Checkpoint: awaiting human review (Phase 1 diff)
 
[Phase 2] Checks Engine
  T007  Freshness check rule ..................................... built → tests pass ✓
  T008  Completeness / null-rate check rule ...................... built → tests pass ✓
  T009  Anomaly (z-score) check rule ............................. built → tests pass ✓
  T010  Check scheduler (periodic run) ........................... built → tests pass ✓
  ⏸  Checkpoint: awaiting human review (Phase 2 diff)
 
[Phase 3] Alert Routing
  T014  Alert router (severity → channel mapping) ................ built → tests pass ✓
  T015  Slack / Email delivery adapters .......................... built → tests pass ✓
  ...

Each ⏸ Checkpoint is not an accident. Every time one layer (phase) of the dependency graph finishes, execution pauses so the human can review that layer's diff all at once. If the data model is wrong, you want to catch it before the entire checks engine is built on top of it. Stack higher layers on a flawed foundation, and the cost grows exponentially.

Watching One Task Become Code

The abstract flow only goes so far, so let's look at how a single task actually becomes code and a test. Take T007 — Freshness check rule. Its original entry in tasks.md read:

- [ ] T007 [P] Implement freshness check rule
      - Input: DataSource.last_updated_at, Threshold.max_staleness_minutes
      - Output: CheckResult(status: OK|WARN|FAIL, observed, expected)
      - Must satisfy FR-003 (alert when freshness threshold exceeded)
      - Edge cases: last_updated_at == now, last_updated_at is null

When the agent implements this task, the following code and test emerge. Note that the requirement ID (FR-003) and the edge cases written in the task carry straight through into the code and tests.

# src/dq_monitor/checks/freshness.py
from datetime import datetime, timezone
from dq_monitor.models import CheckResult, CheckStatus, DataSource, Threshold
 
 
def check_freshness(source: DataSource, threshold: Threshold, *, now: datetime) -> CheckResult:
    """Compare a data source's freshness against its threshold. (FR-003)"""
    if source.last_updated_at is None:
        # Edge case: no update history is treated as FAIL.
        return CheckResult(
            status=CheckStatus.FAIL,
            observed=None,
            expected=threshold.max_staleness_minutes,
            detail="last_updated_at is null",
        )
 
    staleness_min = (now - source.last_updated_at).total_seconds() / 60.0
    if staleness_min > threshold.max_staleness_minutes:
        status = CheckStatus.FAIL
    elif staleness_min > threshold.max_staleness_minutes * 0.8:
        status = CheckStatus.WARN  # pre-emptive warning past 80% of threshold
    else:
        status = CheckStatus.OK
 
    return CheckResult(
        status=status,
        observed=round(staleness_min, 2),
        expected=threshold.max_staleness_minutes,
    )

# tests/checks/test_freshness.py
from datetime import datetime, timedelta, timezone
from dq_monitor.checks.freshness import check_freshness
from dq_monitor.models import CheckStatus, DataSource, Threshold
 
NOW = datetime(2026, 6, 26, 12, 0, tzinfo=timezone.utc)
THRESHOLD = Threshold(max_staleness_minutes=60)
 
 
def _source(updated_at):
    return DataSource(name="orders", last_updated_at=updated_at)
 
 
def test_fresh_data_is_ok():
    r = check_freshness(_source(NOW - timedelta(minutes=10)), THRESHOLD, now=NOW)
    assert r.status == CheckStatus.OK
 
 
def test_stale_data_fails():  # FR-003 core path
    r = check_freshness(_source(NOW - timedelta(minutes=90)), THRESHOLD, now=NOW)
    assert r.status == CheckStatus.FAIL
    assert r.observed == 90.0
 
 
def test_near_threshold_warns():
    r = check_freshness(_source(NOW - timedelta(minutes=55)), THRESHOLD, now=NOW)
    assert r.status == CheckStatus.WARN
 
 
def test_null_last_updated_fails():  # edge case spelled out in tasks.md
    r = check_freshness(_source(None), THRESHOLD, now=NOW)
    assert r.status == CheckStatus.FAIL

$ pytest tests/checks/test_freshness.py -q
....                                                      [100%]
4 passed in 0.06s

This small cycle — task → code → test → pass — repeated 27 times becomes dq-monitor. And the test in each cycle is the gate. If the test goes red, you do not move on to the next task. Even when the AI says "all done," it is not done if it does not clear the gate.

Key point: in the implement phase the human's job is not to re-write code line by line, but to review in task-sized chunks, treat tests as gates, and intervene only on the tasks that drift. The smaller the review unit, the greater the trust.

3. taskstoissues — Bridging SDD into the Team's Daily Workflow

tasks.md is an excellent execution plan, but as long as it lives in a single text file it stays apart from the team's day-to-day. Most teams already slice work into GitHub issues, review via PRs, and track progress on project boards. /speckit.taskstoissues closes that gap — it converts the task list into GitHub issues for tracking.

This is not a simple copy-paste but a bridge between SDD and your existing collaboration tools. Once each task becomes an issue, it gains assignees, labels, milestones, and a board column, and a PR can close it with Closes #N. For teams already working on GitHub, it is an utterly natural connection.

A generated issue looks roughly like this. The key is that the issue body back-references the original task and the spec/plan that task points to.

Issue #14 — [T014] Alert router: severity → channel mapping
 
Labels: spec-kit, phase-3-alert-routing, task
Milestone: dq-monitor v1
---
## Task
Implement a router that routes check-result severity (WARN/FAIL) to alert channels.
 
## Source (Traceability)
- Task: `specs/001-realtime-dq-monitor/tasks.md` → T014
- Requirement: `spec.md` → FR-007 (per-severity channel branching)
- Design: `plan.md` → §4.2 alert pipeline
 
## Definition of Done
- [ ] WARN → #data-quality Slack channel
- [ ] FAIL → #data-quality + on-call email
- [ ] Suppress duplicate alerts within 5 minutes (dedup)
- [ ] Unit tests: 4 severity routing branches
 
## Dependencies
- Blocked by: #10 (check scheduler), #13 (Alert model)

Issues built this way combine SDD's traceability with GitHub's visibility. Glance at the board and you see "Phase 3 alert routing: 2 of 5 left"; open any issue and it records why the work is needed (which FR it satisfies). Managers get progress, developers get context — at the same time.

Daily tool	What taskstoissues fills in
Issue tracker	Each task = a trackable issue (assignee, label, milestone)
Pull Request	`Closes #14` links code to task
Project board	Phase labels become columns / swimlanes
Standup	Report "which FR is blocked" at the issue level

taskstoissues is an optional step. For a quick solo iteration, tasks.md alone is enough. But when a team collaborates on GitHub and has to report progress outward, this bridge dissolves SDD from a "separate ritual" into "the work we were already doing."

4. Converge — The Last Question That Closes "Are We Actually Done?"

Once implement finishes and every test is green, are we done? Here lies SDD's subtlest and most valuable step. /speckit.converge assesses the codebase against the artifacts (spec, plan, tasks) and appends remaining work as new tasks.

Why is this step needed? Because subtle drift accumulates during implement. As you implement tasks, an edge case from the spec quietly slips out, a non-functional requirement from the plan (e.g., "p95 latency under 200ms") never makes it into code, and a stopgap (# TODO: retry logic) lingers. Every individual task's test passes, yet viewed across the whole spec, there are gaps.

converge asks not "did each task pass?" but "did everything the spec promised actually make it into the code?" And rather than just reporting the gaps and stopping, it recovers them as new tasks in tasks.md. That is how it closes the loop back toward the spec.

$ /speckit.converge
 
Assessing codebase ↔ artifacts...
  spec.md: 12 functional requirements (FR), 4 non-functional (NFR)
  plan.md: 6 design decisions
  tasks.md: 27 tasks (27 complete)
  code: 41 modules, 38 test files
 
── Convergence Report ──────────────────────────────────
✓ FR-001..FR-012  implemented and tested
✓ NFR-001 (check interval ≤ 1 min)  confirmed in scheduler
⚠ 3 drifts found → recovered as new tasks:
 
  [NEW] T028  NFR-002 unmet: no measurement/instrumentation of alert p95 latency
         basis: plan.md §5 states "p95 < 200ms", no metric instrumentation in code
  [NEW] T029  FR-009 partial: alert dedup applied to Slack only, email missing
         basis: spec.md FR-009 "5-min dedup on all channels", not applied to email adapter
  [NEW] T030  residual TODO: src/.../retry.py:42 backoff not implemented
         basis: plan.md §4.2 designs "exponential backoff retry", incomplete
 
Added 3 tasks to tasks.md. Continue with /speckit.implement.
────────────────────────────────────────────────────────

The question this report answers is exactly "are we really done?" It answers not with memory or a "feeling" but by comparing the code against the objective baseline of the spec. And the residual tasks it finds (T028–T030) become inputs to implement again. This small loop runs until the gap count reaches zero.

converge is not a step for "finding bugs" but for "aligning the promise with the result." Each test can pass and the result can still diverge from the spec as a whole — catching that divergence and feeding it back into the spec is how SDD proves completion.

5. The Implement–Converge Loop at a Glance

Here is the whole picture of how implement, converge, and taskstoissues mesh together. The key is that it is one closed loop. As long as converge finds gaps, we return to implement.

Loading diagram…

In words, the loop is:

implement turns tasks.md tasks into code and tests (task-level review, test gates).
Optionally, taskstoissues exports tasks to GitHub issues, wiring them to the team's PRs and board.
converge assesses the finished code against spec, plan, and tasks.
If there are gaps, they are recovered as new tasks appended to tasks.md → back to step 1.
When the gap count hits zero, the work is done with the code proven to satisfy the spec.

This loop redeems, at the end, the promise of the "verifiable spec" from Part 1. Done is not "the moment the AI declares it" but "the moment spec and code are aligned."

6. Trust & Verification Practices — and Pitfalls to Avoid

Running implement and converge well ultimately comes down to a few habits. Where the tooling does not enforce discipline, the human must.

Practices to keep

Habit	Why it matters
Small, reviewable units	Small tasks make small diffs, and small diffs actually get reviewed. One or two tasks per PR.
Tests as gates	Not "looks like it passes" but "the test passes." Don't stack the next task on a red test.
Maintain traceability	Put the task ID and FR in commit/PR messages so the code → task → requirement path never breaks.
Don't skip converge	Even when every implement test passes, it isn't "done" until converge.

Common pitfalls

Blindly accepting bulk output — when implement dumps 20 tasks at once and you scroll past "because the diff is long" and approve, traceability becomes decoration. Traceability you don't review is not traceability.
Implementation without tests — no tests means no gate, and no gate means converge has no objective pass criterion to compare against. You slide back to "seems to work."
Mistaking implement for the end — the most common and most expensive pitfall. Every task checkbox being filled is not the end. Tasks can pass individually while the spec as a whole still has gaps, and converge is the step that closes them. Stop at implement and you quietly slide back into Part 1's "unverifiability."

One line: trust is not "believing the AI" but "making the path verifiable." Small units, test gates, unbroken traceability, and converge — these four are what let AI-written code reach production.

Wrapping up

In Part 6 we turned the blueprint into a building. With /speckit.implement we moved dq-monitor's tasks into code and tests in dependency order — not "fire and forget" but reviewed in task-sized units with tests as gates. With /speckit.taskstoissues we wired those tasks to GitHub issues, PRs, and boards, dissolving SDD into the team's daily flow. And with /speckit.converge we compared code against the spec, recovered residual work, and closed the question of "are we actually done?"

The core never changes — the secret to trusting AI-written code is not a smarter model but a traceable path from code to task, to requirement, to constitution. Done is not a declaration but a proof.

In Part 7, the finale, we bring every step we've spread out so far into one. We'll close the series with an end-to-end case study building dq-monitor from zero to one — constitution through convergence — and a retrospective on the pitfalls and lessons we actually hit along the way.

References

GitHub Spec Kit repository

Spec Kit official docs

Diving Into Spec-Driven Development With GitHub Spec Kit (Microsoft for Developers)