spec-kitspec-driven-developmentcase-studyretrospectiveclaude-codeai

[Spec Kit Part 7] In Practice — Building a Data Quality Monitoring Service with SDD (Retrospective)

An end-to-end case study of building a real-time data quality monitoring service (dq-monitor) with SDD, from constitution to converge. Honest numbers, the pitfalls we hit, and how to decide when to switch SDD on or off — closing the 7-part series.

Data DynamicsJune 17, 202619 min read

Over six installments we walked from concept (Part 1) through installing the tooling (Part 2), writing a constitution (Part 3), specify and clarify (Part 4), plan and task breakdown (Part 5), and implement and converge (Part 6) — one step at a time. But explaining the phases in isolation leaves the most pressing question unanswered: "So what does it actually feel like to run the whole thing end to end, from an empty repo to a working service? Where does the time go, what do you lose and gain, and where do you fall down?" This final post is that integrated retrospective. We honestly retrace, without varnish, the 0→1 journey of building dq-monitor — the real-time data quality monitoring service we've built across the series — with SDD.

What you'll learn in this post

A one-view map of what concrete artifact each phase from constitution → converge produced for dq-monitor

How time split between "spec/clarify vs implement," and what the upfront spec cost bought you later (honest numbers)

The 5 pitfalls we actually hit and how we fixed them — the most valuable section here

A decision table for when to switch SDD on vs when it's overkill

An adoption checklist you can apply to your next feature this week

This is Part 7 — the finale — of the Spec Kit series. The retrospective stands on its own even if you skipped the earlier six, but for the details of each phase, read it alongside the relevant part.

Before we begin: The numbers in this post — times, ratios, counts — are organized, illustrative values distilled from one real build. They are not benchmarks; read them as "this is the shape of the distribution," not absolute figures. Your team, domain, and model version will shift them.

1. The whole journey on one page

Start with the big picture. The table below gathers, for each SDD phase, what it produced in dq-monitor and which part of the series covered it. This single table is effectively the series table of contents and the skeleton of this retrospective.

Phase	Command	One-line meaning	Artifact produced for `dq-monitor`	Covered in
Constitution	`/speckit.constitution`	Governing principles	`.specify/memory/constitution.md` — data accuracy, observability, testing, performance	Part 3
Specification	`/speckit.specify`	What & why (requirements, user stories)	`specs/001-dq-monitor/spec.md` — freshness, validity, anomaly monitoring + alerting	Part 4
Clarification	`/speckit.clarify`	Fill gaps with sequential questions	`specs/001-dq-monitor/clarifications.md` — thresholds, dedup, SLA decisions	Part 4
Planning	`/speckit.plan`	How (tech design, architecture)	`plan.md` + `data-model.md` + `contracts/` + `research.md`	Part 5
Tasks	`/speckit.tasks`	Ordered, dependency-aware breakdown	`tasks.md` — T001~T0NN, order & parallel markers	Part 5
Analysis	`/speckit.analyze`	Cross-validate spec/plan/tasks	Consistency report — gaps/contradictions and fixes	Part 5
Implementation	`/speckit.implement`	Execute tasks to produce code	The actual `dq-monitor` codebase	Part 6
Convergence	`/speckit.converge`	Assess code vs artifacts, recover remaining work	Augmented `tasks.md` + completion check	Part 6

Loading diagram…

The crucial point: every one of these artifacts is a plain-text file inside the repo (the concrete reality of the "no agent lock-in" we stressed in Part 1). We built dq-monitor with Claude Code, but you could carry the same specs/001-dq-monitor/ folder to a different agent and the assets would still be alive and intact.

2. From empty repo to working service — a condensed walkthrough

Now the same journey in chronological order, following what each /speckit.* actually emitted, with short excerpts. We won't paste whole files again (that's the job of Parts 4–6). Here we only look at "what you had in hand when this phase ended."

2.1 The line after `git init` — `/speckit.constitution`

The first thing we did in the empty repo was not code but principles. Since dq-monitor handles data, the constitution's first article was "data accuracy takes precedence over availability." That is, rather than let suspect data through, we block and alert.

# dq-monitor Constitution (excerpt)
## Principle 1. Data accuracy first
- Uncertain check results are marked UNKNOWN, not PASS.
## Principle 2. Observability
- Every check run leaves structured logs + metrics.
## Principle 3. Testing
- Every check rule ships with healthy/anomalous fixtures.
## Principle 4. Performance
- A single check batch completes within 5 minutes for a 10M-row dataset.

These four lines of principle get repeatedly cited later in planning and implementation. "Performance: 5 minutes" becomes the rationale for the batch architecture in plan, and "mark UNKNOWN" lands directly in the status enum of data-model.

2.2 What & why — `/speckit.specify`

Next, without writing a single line of technology, we captured what and why. There were three core user stories.

ID	User story (summary)
US-1	As a data engineer, I get alerted when table freshness exceeds the SLA, so I immediately know about pipeline delays
US-2	As a data engineer, I detect null/unique/range rule violations on key columns, to prevent downstream contamination
US-3	As an analyst, I detect statistical anomalies in daily row counts, to catch missing/duplicate loads early

The artifact spec.md deliberately has no "Postgres" and no "Kafka." It has only what (freshness/validity/anomaly monitoring + alerting) and why (preventing downstream contamination). Technology choices belong to the next phase.

2.3 Filling the gaps — `/speckit.clarify`

The moment specify finished, we ran clarify, and the decisions the spec had quietly left blank surfaced as questions. Some of the gaps we filled by answering sequential questions:

Who sets the freshness threshold? → Per-table config (YAML), defaulting to 24 hours if unset.
Alert every time the same violation repeats? → No. Dedup by (check, target) key for 60 minutes, plus one recovery alert on resolution.
What's the definition of an "anomaly"? → Deviation beyond mean±3σ over a 28-day window; hold judgment (UNKNOWN) until at least 14 days of data have accumulated.

Without these three lines, the AI at implementation time would have "plausibly" hardcoded thresholds, fired alerts every minute, and declared anomalies off two days of data. Clarify isn't about reducing the AI's freedom — it's about returning decisions that humans must own back to humans.

2.4 How — `/speckit.plan`

Only now does technology appear. plan took the constitution and spec as input and designed the architecture, emerging not as one file but as several artifacts.

Artifact	Contents (excerpt)
`plan.md`	Scheduler → check workers → result store → alert dispatcher. Python + APScheduler, results in Postgres.
`data-model.md`	`Check`, `CheckRun(status: PASS/FAIL/UNKNOWN)`, `Incident`, `Notification` entities
`contracts/`	Check-rule plugin interface, alert-channel interface (Slack/Webhook)
`research.md`	Rationale for tech choices ("why APScheduler") and alternatives compared

Here the constitution does its work. The UNKNOWN in CheckRun.status is no accident — it's Principle 1 translated into the data model.

2.5 Breaking into tasks — `/speckit.tasks`

tasks broke the plan into a checkable, dependency-ordered task list.

# tasks.md (excerpt)
- [ ] T001 Project scaffolding / dependencies / lint & test setup
- [ ] T002 [P] data-model entities + migration
- [ ] T003 [P] check-rule plugin interface (from contracts)
- [ ] T004 Freshness check rule + fixtures (US-1)
- [ ] T005 Validity check rule (null/unique/range) + fixtures (US-2)
- [ ] T006 Anomaly check rule (28d/3σ) + fixtures (US-3)
- [ ] T007 Alert dispatcher + 60-min dedup
- [ ] T008 Scheduler integration + end-to-end smoke test

[P] marks parallelizable work, and each task tracks which user story (US-n) it satisfies. That traceability later turns "why does this code exist?" into the instant answer "T005 → US-2."

2.6 The checkpoint before merging — `/speckit.analyze`

Right before implementation, we cross-validated spec, plan, and tasks with analyze. What it actually caught:

The spec's US-3 (anomalies) required "hold judgment until at least 14 days accumulated," but T006 in tasks.md was missing that hold logic. → Split T006 and add a "cold-start guard" subtask.
plan named two channels, Slack and Webhook, but tasks had no Webhook work. → Make the channel abstraction explicit in T007.

Those two finds are analyze's payback. Defects that would have meant rework if discovered after implementation were headed off with a spec edit, before a line of code.

2.7 The actual code — `/speckit.implement`

Here real code appears for the first time. implement executed tasks.md top to bottom. The key point: this phase wasn't "write it all in one shot" — it proceeded task by task. When T002 finished, the migration ran; when T004 finished, the freshness-check fixture tests went green. Each task automatically enforced Principle 3 ("every rule ships with fixtures").

2.8 The final reconciliation — `/speckit.converge`

When implementation "seemed done," we ran converge. The key is that this isn't a simple "are we done?" check — it's the phase that assesses the codebase against the artifacts and recovers missing work back into tasks. What converge recovered:

Alert dedup was implemented, but the recovery (resolved) alert (the one-time recovery alert agreed in clarify) was missing → added as a new task.
The "check-result retention" policy mentioned in research.md wasn't in the code → turned into a task.

Without converge, those two would have stayed as debt for "someone to discover later."

3. Honest numbers and trade-offs

Now the most sensitive question. So where did the time go? Did the upfront cost spent on the spec pay for itself? Again, the numbers below are illustrative values distilled from one build.

3.1 Time split by phase (illustrative)

Phase	Rough share	Character
constitution	5%	Written once, reused across the series
specify	12%	Human time organizing thought
clarify	13%	Time answering questions (looks high, but buys back later rework)
plan	15%	AI-driven, human review
tasks	8%	Nearly automatic
analyze	7%	Short but high recovery rate
implement	35%	The actual coding
converge	5%	Closing recovery

What stands out is that spec, clarify, and analyze combined took about a third (specify+clarify+analyze ≈ 32%). With vibe coding you could skip that whole third and go straight to implement. That is SDD's upfront cost. Honestly: on small work, that upfront is simply a loss.

3.2 What that upfront bought downstream

In exchange, that upfront was recovered three ways downstream.

Less rework: analyze caught 2 and converge caught 2 defects "before/just after writing code." Found at the integration-test stage instead, the same defects would have cost several times more in debugging and re-implementation.
Review speed: because all code traces T00n → US-n, reviewers rarely had to round-trip on "why is this needed."
Zero re-entry cost: come back days later and reading spec.md alone restores context (the structural cure for Part 1's "context loss").

3.3 Same feature, two ways

Item	Vibe coding	SDD
Time to first working	Fast	Slow (spec upfront)
Stability on the second request	Fragile	Stable (a spec baseline exists)
When defects are found	Integration/production (late)	analyze, converge (named)
Reviewability	"By feel"	Comparable against spec
Re-entry days later	Context rebuild needed	Restored by reading spec.md
Team collaboration	Verbal agreement	File = contract
Throwaway script	Wins	Overhead only
Multi-touch production feature	Debt accumulates	Wins

The conclusion is simple. SDD has clear overhead, and that overhead only pays for itself when a "second request" is coming. If the second request never comes, SDD is overkill. That judgment is the heart of Section 5.

4. The pitfalls we hit (the most valuable part)

A failure story is worth more than a smooth success story. Here are the five places we actually fell down while building dq-monitor, and how we got back up — framed as lessons.

Pitfall 1 — Technology leaking into the spec

In the first specify draft we wrote US-1 as "poll the updated_at of the Postgres table to check freshness." See it? "How" (Postgres, polling updated_at) crept into the slot that should hold "what."

Symptom: when a spec presupposes a specific implementation, the plan phase loses design freedom, and changing technology later forces you to edit the spec too. Fix: revert US-1 to "detect when table freshness exceeds the SLA," and push "how do we measure freshness (updated_at? metadata?)" down into clarify and plan. Lesson: when you see proper nouns (products, libraries, column names) in a spec, be suspicious. Those are usually decisions that belong down in plan.

Pitfall 2 — Skipping clarify

Once, saying "this is obvious," we sent a small follow-up feature (adding an alert channel) straight to plan without clarify. The result: implement invented an alert-failure retry policy on its own (infinite retry → risk of an alert storm in production).

Symptom: "it's obvious" is a human illusion. Decisions self-evident to a human (retry count, backoff) are blanks to the AI, and the AI fills blanks plausibly. Fix: lift the retry policy to a clarify item and codify it as "max 3 retries, exponential backoff, then dead-letter." Lesson: clarify isn't "for complex features" — it's "for features where humans unconsciously assume decisions." That's nearly every feature.

Pitfall 3 — Tasks too coarse

The early tasks.md had a single task (T0xx) called "implement the check engine." When implement took it, it built freshness, validity, and anomaly as one lump, and the intermediate verification points vanished. When one part broke, tracing where it went wrong was hard.

Symptom: a coarse task = a big one-shot = unverifiable. The essence of SDD, "verify each step," collapses at the level of task granularity. Fix: split that one task into freshness/validity/anomaly + fixtures each (today's T004~T006). Make each task go green independently. Lesson: a task's upper bound is "reviewable in one sitting." If one task touches multiple user stories, split it.

Pitfall 4 — Accepting implement output without converge

In dq-monitor's first cycle we accepted implement's "done" at face value. Running converge revealed that the two items in §2.8 (recovery alert, retention) were silently missing. "Code that looks like it works" and "code that satisfies the spec" are not the same.

Symptom: implement reports only what it did. It does not report what it didn't do. To fill that gap a human would have to re-read all the artifacts — and converge is the automation of exactly that. Fix: fix converge as a mandatory step right after implement. Take what converge recovers as new tasks and run one more cycle. Lesson: implement's "done" is a proposal, not a declaration. It isn't finished until you reconcile against the spec with converge.

Pitfall 5 — A constitution nobody enforced

The constitution we set up so elegantly in Part 3 was barely cited by early implement. Even with a "performance: 5 minutes" principle, the first anomaly check was written as a row-by-row loop and was slow. A constitution that exists only as a file and isn't wired into the prompts is decoration.

Symptom: if the constitution doesn't actually enter implement's context, the AI doesn't know those principles. Fix: in plan and tasks, state which constitution principle each task satisfies/risks (e.g., a note on T006: "Principle 4: vectorization required"). Make analyze also check constitution violations via a checklist. Lesson: a constitution must live as "cited at every step," not as "written down once." A principle that isn't enforced is not a principle.

5. When to switch SDD on vs when it's overkill

For this series to introduce the tool honestly, it must also clearly say "when not to use it." We re-forge Part 1's message with the case-study experience.

Situation	SDD?	Why
Multi-touch production feature	Yes	The "second request" is certain. Spec cost is recovered.
Feature with ambiguous requirements	Yes	clarify surfaces humans' hidden assumptions.
Code a team works on together	Yes	The file becomes a contract, preventing verbal agreements from evaporating.
Days-to-weeks of work	Yes	On re-entry, spec.md restores context.
Throwaway script	No	No second request. The upfront is pure loss.
10-line bug fix	No	Reading the code beats reading a spec.
Earliest exploratory prototyping	No / partial	When you don't yet know what to build, a spec ties your hands.

The one core principle: quality gates are dials, not tolls. SDD isn't switching constitution, clarify, analyze, and checklist on by obligation every time — SDD done well is the judgment to switch gates on and off to match the weight of the work. For multi-touch, multi-person, ambiguous work like dq-monitor, turn all gates on; for a 10-line patch, turn them all off.

6. An adoption checklist you can use this week

A practical checklist for anyone who wants to pilot SDD on their next single feature. Don't do it all at once — switch them on from the top.

7. Closing the series — the arc of seven parts

Let's rewind all seven installments into one arc.

Part	Topic	One-line summary	That moment in `dq-monitor`
Part 1	Why SDD	Three failure modes of vibe coding; spec = truth	Problem definition
Part 2	Getting started	`specify` install, init, Claude Code wiring	Empty-repo setup
Part 3	Constitution	Principles for a data team	Accuracy/observability/testing/performance principles
Part 4	Specify & Clarify	What & why + remove ambiguity	US-1~3, threshold & dedup decisions
Part 5	Plan & Tasks	Architecture, tasks, consistency	plan/data-model/contracts, tasks, analyze
Part 6	Implement & Converge	Implementation, issue linking, completion check	Codebase + recovered remainder
Part 7 (this post)	Practice & retrospective	0→1 case study, pitfalls, judgment	The full integrated retrospective

Once more, the core thesis

The single sentence running through the series stayed the same from start to finish.

The source of truth is the spec, not the code. That spec is the one contract humans and AI read together, verify together, and revise together.

The reason vibe coding collapses on the "second request" was never that the AI is dumb — it's that there was nowhere recording what is correct. SDD puts the spec in that empty slot. The entire dq-monitor journey was a proof of this one sentence. analyze catching defects, converge recovering the remainder, context restored days later by reading spec.md alone — all of it "because a comparison baseline called the spec existed."

Onward — scaling to teams (a teaser for what's next)

This series got as far as one developer building one feature with SDD. The real challenge is what comes next: a whole team running SDD in a consistent way. Spec Kit has extension mechanisms for exactly that.

Extensions: add new slash commands to slot in your team's own phases.
Presets: customize workflows to lock in defaults like "our team always switches these gates on."
Bundles: package extensions, presets, and workflows to provision teams by role — a data-engineer bundle, a front-end bundle, and so on.

And all of this is possible because of that fact from Part 1: all artifacts are plain-text files, so there's no agent lock-in (compatible with 30+ agents). Even if the team changes tools, the assets — spec, plan, tasks — come right along.

This "team-scale SDD operations" is a whole series in itself. We'll be back another time with the story of packaging a data team's standard SDD workflow into presets and bundles. Thank you for staying through all seven parts. Now — start your next feature with a single line of /speckit.specify.

References

GitHub Spec Kit repository

Spec Kit official docs

Spec Kit Quickstart

Diving Into Spec-Driven Development With GitHub Spec Kit (Microsoft for Developers)