[Spec Kit Part 7] In Practice — Building a Data Quality Monitoring Service with SDD (Retrospective)
An end-to-end case study of building a real-time data quality monitoring service (dq-monitor) with SDD, from constitution to converge. Honest numbers, the pitfalls we hit, and how to decide when to switch SDD on or off — closing the 7-part series.
Over six installments we walked from concept (Part 1) through installing the tooling (Part 2), writing a constitution (Part 3), specify and clarify (Part 4), plan and task breakdown (Part 5), and implement and converge (Part 6) — one step at a time. But explaining the phases in isolation leaves the most pressing question unanswered: "So what does it actually feel like to run the whole thing end to end, from an empty repo to a working service? Where does the time go, what do you lose and gain, and where do you fall down?" This final post is that integrated retrospective. We honestly retrace, without varnish, the 0→1 journey of building dq-monitor — the real-time data quality monitoring service we've built across the series — with SDD.
What you'll learn in this post
- A one-view map of what concrete artifact each phase from constitution → converge produced for
dq-monitor- How time split between "spec/clarify vs implement," and what the upfront spec cost bought you later (honest numbers)
- The 5 pitfalls we actually hit and how we fixed them — the most valuable section here
- A decision table for when to switch SDD on vs when it's overkill
- An adoption checklist you can apply to your next feature this week
This is Part 7 — the finale — of the Spec Kit series. The retrospective stands on its own even if you skipped the earlier six, but for the details of each phase, read it alongside the relevant part.
Before we begin: The numbers in this post — times, ratios, counts — are organized, illustrative values distilled from one real build. They are not benchmarks; read them as "this is the shape of the distribution," not absolute figures. Your team, domain, and model version will shift them.
1. The whole journey on one page
Start with the big picture. The table below gathers, for each SDD phase, what it produced in dq-monitor and which part of the series covered it. This single table is effectively the series table of contents and the skeleton of this retrospective.
| Phase | Command | One-line meaning | Artifact produced for dq-monitor | Covered in |
|---|---|---|---|---|
| Constitution | /speckit.constitution | Governing principles | .specify/memory/constitution.md — data accuracy, observability, testing, performance | Part 3 |
| Specification | /speckit.specify | What & why (requirements, user stories) | specs/001-dq-monitor/spec.md — freshness, validity, anomaly monitoring + alerting | Part 4 |
| Clarification | /speckit.clarify | Fill gaps with sequential questions | specs/001-dq-monitor/clarifications.md — thresholds, dedup, SLA decisions | Part 4 |
| Planning | /speckit.plan | How (tech design, architecture) | plan.md + data-model.md + contracts/ + research.md | Part 5 |
| Tasks | /speckit.tasks | Ordered, dependency-aware breakdown | tasks.md — T001~T0NN, order & parallel markers | Part 5 |
| Analysis | /speckit.analyze | Cross-validate spec/plan/tasks | Consistency report — gaps/contradictions and fixes | Part 5 |
| Implementation | /speckit.implement | Execute tasks to produce code | The actual dq-monitor codebase | Part 6 |
| Convergence | /speckit.converge | Assess code vs artifacts, recover remaining work | Augmented tasks.md + completion check | Part 6 |
The crucial point: every one of these artifacts is a plain-text file inside the repo (the concrete reality of the "no agent lock-in" we stressed in Part 1). We built dq-monitor with Claude Code, but you could carry the same specs/001-dq-monitor/ folder to a different agent and the assets would still be alive and intact.
2. From empty repo to working service — a condensed walkthrough
Now the same journey in chronological order, following what each /speckit.* actually emitted, with short excerpts. We won't paste whole files again (that's the job of Parts 4–6). Here we only look at "what you had in hand when this phase ended."
2.1 The line after git init — /speckit.constitution
The first thing we did in the empty repo was not code but principles. Since dq-monitor handles data, the constitution's first article was "data accuracy takes precedence over availability." That is, rather than let suspect data through, we block and alert.
# dq-monitor Constitution (excerpt)
## Principle 1. Data accuracy first
- Uncertain check results are marked UNKNOWN, not PASS.
## Principle 2. Observability
- Every check run leaves structured logs + metrics.
## Principle 3. Testing
- Every check rule ships with healthy/anomalous fixtures.
## Principle 4. Performance
- A single check batch completes within 5 minutes for a 10M-row dataset.These four lines of principle get repeatedly cited later in planning and implementation. "Performance: 5 minutes" becomes the rationale for the batch architecture in plan, and "mark UNKNOWN" lands directly in the status enum of data-model.
2.2 What & why — /speckit.specify
Next, without writing a single line of technology, we captured what and why. There were three core user stories.
| ID | User story (summary) |
|---|---|
| US-1 | As a data engineer, I get alerted when table freshness exceeds the SLA, so I immediately know about pipeline delays |
| US-2 | As a data engineer, I detect null/unique/range rule violations on key columns, to prevent downstream contamination |
| US-3 | As an analyst, I detect statistical anomalies in daily row counts, to catch missing/duplicate loads early |
The artifact spec.md deliberately has no "Postgres" and no "Kafka." It has only what (freshness/validity/anomaly monitoring + alerting) and why (preventing downstream contamination). Technology choices belong to the next phase.
2.3 Filling the gaps — /speckit.clarify
The moment specify finished, we ran clarify, and the decisions the spec had quietly left blank surfaced as questions. Some of the gaps we filled by answering sequential questions:
- Who sets the freshness threshold? → Per-table config (YAML), defaulting to 24 hours if unset.
- Alert every time the same violation repeats? → No. Dedup by (check, target) key for 60 minutes, plus one recovery alert on resolution.
- What's the definition of an "anomaly"? → Deviation beyond mean±3σ over a 28-day window; hold judgment (UNKNOWN) until at least 14 days of data have accumulated.
Without these three lines, the AI at implementation time would have "plausibly" hardcoded thresholds, fired alerts every minute, and declared anomalies off two days of data. Clarify isn't about reducing the AI's freedom — it's about returning decisions that humans must own back to humans.
2.4 How — /speckit.plan
Only now does technology appear. plan took the constitution and spec as input and designed the architecture, emerging not as one file but as several artifacts.
| Artifact | Contents (excerpt) |
|---|---|
plan.md | Scheduler → check workers → result store → alert dispatcher. Python + APScheduler, results in Postgres. |
data-model.md | Check, CheckRun(status: PASS/FAIL/UNKNOWN), Incident, Notification entities |
contracts/ | Check-rule plugin interface, alert-channel interface (Slack/Webhook) |
research.md | Rationale for tech choices ("why APScheduler") and alternatives compared |
Here the constitution does its work. The UNKNOWN in CheckRun.status is no accident — it's Principle 1 translated into the data model.
2.5 Breaking into tasks — /speckit.tasks
tasks broke the plan into a checkable, dependency-ordered task list.
# tasks.md (excerpt)
- [ ] T001 Project scaffolding / dependencies / lint & test setup
- [ ] T002 [P] data-model entities + migration
- [ ] T003 [P] check-rule plugin interface (from contracts)
- [ ] T004 Freshness check rule + fixtures (US-1)
- [ ] T005 Validity check rule (null/unique/range) + fixtures (US-2)
- [ ] T006 Anomaly check rule (28d/3σ) + fixtures (US-3)
- [ ] T007 Alert dispatcher + 60-min dedup
- [ ] T008 Scheduler integration + end-to-end smoke test[P] marks parallelizable work, and each task tracks which user story (US-n) it satisfies. That traceability later turns "why does this code exist?" into the instant answer "T005 → US-2."
2.6 The checkpoint before merging — /speckit.analyze
Right before implementation, we cross-validated spec, plan, and tasks with analyze. What it actually caught:
- The spec's US-3 (anomalies) required "hold judgment until at least 14 days accumulated," but T006 in
tasks.mdwas missing that hold logic. → Split T006 and add a "cold-start guard" subtask. plannamed two channels, Slack and Webhook, buttaskshad no Webhook work. → Make the channel abstraction explicit in T007.
Those two finds are analyze's payback. Defects that would have meant rework if discovered after implementation were headed off with a spec edit, before a line of code.
2.7 The actual code — /speckit.implement
Here real code appears for the first time. implement executed tasks.md top to bottom. The key point: this phase wasn't "write it all in one shot" — it proceeded task by task. When T002 finished, the migration ran; when T004 finished, the freshness-check fixture tests went green. Each task automatically enforced Principle 3 ("every rule ships with fixtures").
2.8 The final reconciliation — /speckit.converge
When implementation "seemed done," we ran converge. The key is that this isn't a simple "are we done?" check — it's the phase that assesses the codebase against the artifacts and recovers missing work back into tasks. What converge recovered:
- Alert dedup was implemented, but the recovery (resolved) alert (the one-time recovery alert agreed in clarify) was missing → added as a new task.
- The "check-result retention" policy mentioned in
research.mdwasn't in the code → turned into a task.
Without converge, those two would have stayed as debt for "someone to discover later."
3. Honest numbers and trade-offs
Now the most sensitive question. So where did the time go? Did the upfront cost spent on the spec pay for itself? Again, the numbers below are illustrative values distilled from one build.
3.1 Time split by phase (illustrative)
| Phase | Rough share | Character |
|---|---|---|
| constitution | 5% | Written once, reused across the series |
| specify | 12% | Human time organizing thought |
| clarify | 13% | Time answering questions (looks high, but buys back later rework) |
| plan | 15% | AI-driven, human review |
| tasks | 8% | Nearly automatic |
| analyze | 7% | Short but high recovery rate |
| implement | 35% | The actual coding |
| converge | 5% | Closing recovery |
What stands out is that spec, clarify, and analyze combined took about a third (specify+clarify+analyze ≈ 32%). With vibe coding you could skip that whole third and go straight to implement. That is SDD's upfront cost. Honestly: on small work, that upfront is simply a loss.
3.2 What that upfront bought downstream
In exchange, that upfront was recovered three ways downstream.
- Less rework: analyze caught 2 and converge caught 2 defects "before/just after writing code." Found at the integration-test stage instead, the same defects would have cost several times more in debugging and re-implementation.
- Review speed: because all code traces
T00n → US-n, reviewers rarely had to round-trip on "why is this needed." - Zero re-entry cost: come back days later and reading
spec.mdalone restores context (the structural cure for Part 1's "context loss").
3.3 Same feature, two ways
| Item | Vibe coding | SDD |
|---|---|---|
| Time to first working | Fast | Slow (spec upfront) |
| Stability on the second request | Fragile | Stable (a spec baseline exists) |
| When defects are found | Integration/production (late) | analyze, converge (named) |
| Reviewability | "By feel" | Comparable against spec |
| Re-entry days later | Context rebuild needed | Restored by reading spec.md |
| Team collaboration | Verbal agreement | File = contract |
| Throwaway script | Wins | Overhead only |
| Multi-touch production feature | Debt accumulates | Wins |
The conclusion is simple. SDD has clear overhead, and that overhead only pays for itself when a "second request" is coming. If the second request never comes, SDD is overkill. That judgment is the heart of Section 5.
4. The pitfalls we hit (the most valuable part)
A failure story is worth more than a smooth success story. Here are the five places we actually fell down while building dq-monitor, and how we got back up — framed as lessons.
Pitfall 1 — Technology leaking into the spec
In the first specify draft we wrote US-1 as "poll the updated_at of the Postgres table to check freshness." See it? "How" (Postgres, polling updated_at) crept into the slot that should hold "what."
Symptom: when a spec presupposes a specific implementation, the plan phase loses design freedom, and changing technology later forces you to edit the spec too. Fix: revert US-1 to "detect when table freshness exceeds the SLA," and push "how do we measure freshness (updated_at? metadata?)" down into
clarifyandplan. Lesson: when you see proper nouns (products, libraries, column names) in a spec, be suspicious. Those are usually decisions that belong down in plan.
Pitfall 2 — Skipping clarify
Once, saying "this is obvious," we sent a small follow-up feature (adding an alert channel) straight to plan without clarify. The result: implement invented an alert-failure retry policy on its own (infinite retry → risk of an alert storm in production).
Symptom: "it's obvious" is a human illusion. Decisions self-evident to a human (retry count, backoff) are blanks to the AI, and the AI fills blanks plausibly. Fix: lift the retry policy to a clarify item and codify it as "max 3 retries, exponential backoff, then dead-letter." Lesson: clarify isn't "for complex features" — it's "for features where humans unconsciously assume decisions." That's nearly every feature.
Pitfall 3 — Tasks too coarse
The early tasks.md had a single task (T0xx) called "implement the check engine." When implement took it, it built freshness, validity, and anomaly as one lump, and the intermediate verification points vanished. When one part broke, tracing where it went wrong was hard.
Symptom: a coarse task = a big one-shot = unverifiable. The essence of SDD, "verify each step," collapses at the level of task granularity. Fix: split that one task into freshness/validity/anomaly + fixtures each (today's T004~T006). Make each task go green independently. Lesson: a task's upper bound is "reviewable in one sitting." If one task touches multiple user stories, split it.
Pitfall 4 — Accepting implement output without converge
In dq-monitor's first cycle we accepted implement's "done" at face value. Running converge revealed that the two items in §2.8 (recovery alert, retention) were silently missing. "Code that looks like it works" and "code that satisfies the spec" are not the same.
Symptom: implement reports only what it did. It does not report what it didn't do. To fill that gap a human would have to re-read all the artifacts — and converge is the automation of exactly that. Fix: fix converge as a mandatory step right after implement. Take what converge recovers as new tasks and run one more cycle. Lesson: implement's "done" is a proposal, not a declaration. It isn't finished until you reconcile against the spec with converge.
Pitfall 5 — A constitution nobody enforced
The constitution we set up so elegantly in Part 3 was barely cited by early implement. Even with a "performance: 5 minutes" principle, the first anomaly check was written as a row-by-row loop and was slow. A constitution that exists only as a file and isn't wired into the prompts is decoration.
Symptom: if the constitution doesn't actually enter implement's context, the AI doesn't know those principles. Fix: in plan and tasks, state which constitution principle each task satisfies/risks (e.g., a note on T006: "Principle 4: vectorization required"). Make analyze also check constitution violations via a checklist. Lesson: a constitution must live as "cited at every step," not as "written down once." A principle that isn't enforced is not a principle.
5. When to switch SDD on vs when it's overkill
For this series to introduce the tool honestly, it must also clearly say "when not to use it." We re-forge Part 1's message with the case-study experience.
| Situation | SDD? | Why |
|---|---|---|
| Multi-touch production feature | Yes | The "second request" is certain. Spec cost is recovered. |
| Feature with ambiguous requirements | Yes | clarify surfaces humans' hidden assumptions. |
| Code a team works on together | Yes | The file becomes a contract, preventing verbal agreements from evaporating. |
| Days-to-weeks of work | Yes | On re-entry, spec.md restores context. |
| Throwaway script | No | No second request. The upfront is pure loss. |
| 10-line bug fix | No | Reading the code beats reading a spec. |
| Earliest exploratory prototyping | No / partial | When you don't yet know what to build, a spec ties your hands. |
The one core principle: quality gates are dials, not tolls. SDD isn't switching constitution, clarify, analyze, and checklist on by obligation every time — SDD done well is the judgment to switch gates on and off to match the weight of the work. For multi-touch, multi-person, ambiguous work like dq-monitor, turn all gates on; for a 10-line patch, turn them all off.
6. An adoption checklist you can use this week
A practical checklist for anyone who wants to pilot SDD on their next single feature. Don't do it all at once — switch them on from the top.
- Pick a candidate feature — one where a second request is coming (not a throwaway script).
- Lay SDD scaffolding into the project with the
specifyCLI (see Part 2). - Write 3–5 principles with
/speckit.constitution— keep them short. Long ones go unread. - With
/speckit.specify, write only what & why, leaving out technology. If you see proper nouns, push them down to plan (Pitfall 1). - Do not skip
/speckit.clarify. Be suspicious of the "this is obvious" feeling (Pitfall 2). - Design the architecture with
/speckit.plan, and wire each design decision to a constitution principle (Pitfall 5). - Break down with
/speckit.tasks, and check that each task is reviewable in one sitting (Pitfall 3). - Cross-validate spec, plan, and tasks with
/speckit.analyzebefore implement. - Run
/speckit.implementtask by task, confirming green tests at the end of each task. - Run
/speckit.convergeas a mandatory step to recover missing work. Don't trust "seems done" (Pitfall 4). - After one cycle, write down the time spent and the rework recovered — that's your basis for the next feature's gate-dial judgment.
7. Closing the series — the arc of seven parts
Let's rewind all seven installments into one arc.
| Part | Topic | One-line summary | That moment in dq-monitor |
|---|---|---|---|
| Part 1 | Why SDD | Three failure modes of vibe coding; spec = truth | Problem definition |
| Part 2 | Getting started | specify install, init, Claude Code wiring | Empty-repo setup |
| Part 3 | Constitution | Principles for a data team | Accuracy/observability/testing/performance principles |
| Part 4 | Specify & Clarify | What & why + remove ambiguity | US-1~3, threshold & dedup decisions |
| Part 5 | Plan & Tasks | Architecture, tasks, consistency | plan/data-model/contracts, tasks, analyze |
| Part 6 | Implement & Converge | Implementation, issue linking, completion check | Codebase + recovered remainder |
| Part 7 (this post) | Practice & retrospective | 0→1 case study, pitfalls, judgment | The full integrated retrospective |
Once more, the core thesis
The single sentence running through the series stayed the same from start to finish.
The source of truth is the spec, not the code. That spec is the one contract humans and AI read together, verify together, and revise together.
The reason vibe coding collapses on the "second request" was never that the AI is dumb — it's that there was nowhere recording what is correct. SDD puts the spec in that empty slot. The entire dq-monitor journey was a proof of this one sentence. analyze catching defects, converge recovering the remainder, context restored days later by reading spec.md alone — all of it "because a comparison baseline called the spec existed."
Onward — scaling to teams (a teaser for what's next)
This series got as far as one developer building one feature with SDD. The real challenge is what comes next: a whole team running SDD in a consistent way. Spec Kit has extension mechanisms for exactly that.
- Extensions: add new slash commands to slot in your team's own phases.
- Presets: customize workflows to lock in defaults like "our team always switches these gates on."
- Bundles: package extensions, presets, and workflows to provision teams by role — a data-engineer bundle, a front-end bundle, and so on.
And all of this is possible because of that fact from Part 1: all artifacts are plain-text files, so there's no agent lock-in (compatible with 30+ agents). Even if the team changes tools, the assets — spec, plan, tasks — come right along.
This "team-scale SDD operations" is a whole series in itself. We'll be back another time with the story of packaging a data team's standard SDD workflow into presets and bundles. Thank you for staying through all seven parts. Now — start your next feature with a single line of /speckit.specify.
References