[Spec Kit Part 5] Plan & Tasks — Technical Design and Task Breakdown
The spec pinned down the what and why. Now /speckit.plan designs the how (tech stack and architecture), /speckit.tasks turns it into a dependency-ordered task list, and /speckit.analyze cross-validates consistency across spec, plan, and tasks.
In Part 4 we wrote the spec for dq-monitor (a real-time data quality monitoring service) and filled the gaps with /speckit.clarify. Now spec.md clearly states what we are building and why, deliberately without any tech stack. But a spec alone produces no code. There is a wide river between the requirement "monitor freshness" and the implementation "a Kafka consumer polls lag every 5 seconds." This post is about the two bridges that cross that river — /speckit.plan (technical design) and /speckit.tasks (task breakdown) — and /speckit.analyze (the consistency gate) that checks the bridges are sound once you have crossed.
What you'll learn in this post
- How
/speckit.planreads spec.md + constitution.md to design the tech stack and architecture- The supporting artifacts the plan phase emits —
data-model.md,contracts/,research.md,quickstart.md- How
/speckit.tasksbreaks the plan into a dependency-ordered task list (tasks.md)- How
/speckit.analyzeis a cross-validation gate that catches contradictions, gaps, and untraceable work across spec, plan, and tasks- Common pitfalls: planning before clarifying, tasks too coarse to verify, skipping analyze
This is Part 5 of the Spec Kit series. It assumes the spec was completed in Part 4, and the next Part 6, Implement & Converge, turns this task list into actual code.
1. Where Plan sits in the workflow — from "what" to "how"
Recall the Spec Kit flow: Constitution → Specify → Clarify → Plan → Tasks → Analyze → Implement → Converge. We now stand right in the middle, at the inflection point that crosses from the "what/why" territory into the "how" territory.
This boundary is a line drawn deliberately in SDD.
| Spec territory | Plan territory |
|---|---|
| What and why (requirements, user stories) | How (tech stack, architecture) |
| Technology-neutral — mentions no DB, no language | Technology decisions — names PostgreSQL, Python, Kafka |
| Reviewable by non-developers | A design document reviewed by engineers |
| Changing it doesn't shake the whole implementation | Changing it cascades into tasks and code |
Why split at all? If you hard-code "use PostgreSQL" into the spec, then later when you decide "actually we have so much time-series data that TimescaleDB fits better," you have to tear up the requirements document too. Separating the what from the how lets you change the how freely while the what stays fixed. This is the structural reason SDD can treat the spec as the source of truth.
2. /speckit.plan — drawing out the technical design
/speckit.plan reads two inputs: spec.md (what/why) and constitution.md (the principles that run through the project). It then writes a technical implementation plan that satisfies both to specs/001-dq-monitor/plan.md.
This is where the tech stack first appears. The boxes intentionally left blank during the spec phase now get filled in.
2.1 A realistic plan prompt
A planning-phase prompt is best given not as "design whatever you like," but as guardrails that state constraints and preferences explicitly. The emptier a decision is, the more plausible-but-baseless a choice the agent will make.
/speckit.plan
dq-monitor is a backend service that monitors the freshness, consistency,
and anomalies of data pipelines. UI is out of scope this round; we build
only up to the REST API and alerting.
Tech constraints/preferences:
- Language: Python 3.12 (team standard, data-library ecosystem)
- Input: pipeline execution events arrive on a Kafka topic
- State/metadata: PostgreSQL (check definitions, run history, alert records)
- Alerting: support a single Slack Incoming Webhook first, extensible later
- Deployment: a single container image, external deps are only Kafka & PostgreSQL
- Per the constitution's "observability first" principle, expose
structured logging and metrics
Architecture must follow the constitution's module-boundary principle, and
the anomaly-detection approach should start simple/explainable (leave the
rationale in a research artifact).The key part is the explicit invocation of the constitution. If the constitution from Part 3 holds principles like "observability first," "module boundaries," and "start with explainable simplicity," the plan must translate those principles into design decisions. The agent reads them automatically, but naming them once more in the prompt raises consistency.
2.2 What plan.md decides — the dq-monitor stack
The plan.md the agent produces carries decisions together with their rationale. A stack list without rationale becomes a myth no one can later change. For dq-monitor, decisions like these would emerge.
| Area | Choice | Rationale (summary) |
|---|---|---|
| Language/runtime | Python 3.12 | Team standard, data validation/stats library ecosystem, constitution's "favor team familiarity" |
| Event input | Kafka (confluent-kafka) | Pipeline events already flow on Kafka, at-least-once consumption is enough |
| State store | PostgreSQL 16 | The relational model for checks/history/alerts is natural and needs transactions |
| Check engine | In-process rule evaluator | No separate worker needed at early scale, isolated by module boundary so it can be extracted later |
| Anomaly detection | Rolling statistics (z-score / IQR) | Explainable, easy to debug, constitution's "start with explainable simplicity" |
| Alerting | Slack Incoming Webhook | Start with a single channel, abstracted behind a Notifier interface for extension |
| API | FastAPI + Uvicorn | Auto-generated OpenAPI keeps it in sync with contracts |
| Observability | Structured (JSON) logging + /metrics (Prometheus) | Constitution's "observability first" |
Record simplicity as a decision, not an excuse. Choosing "z-score instead of ML for anomaly detection" is not laziness — it is a decision. Leave the rationale (explainability, operational cost, lack of early data) in
research.md, and six months later the question "why didn't you use ML?" is answered by a document, not by the code.
2.3 plan.md excerpt — architecture overview
The body of plan.md typically includes a component diagram and data flow like this.
## Architecture Overview
dq-monitor is a single service composed of 4 internal modules.
1. **Ingest** — Kafka consumer. Receives pipeline execution events,
normalizes them, and hands them to the Check Engine.
2. **Check Engine** — looks up Check definitions matching the event and
evaluates rules (freshness/consistency/anomaly) to produce CheckResults.
3. **Alert Router** — takes FAIL CheckResults and routes them per the alert
policy to a Notifier (currently Slack). Applies a dedup window.
4. **API** — REST layer exposing check CRUD and result/alert queries.
All modules share PostgreSQL as the state store, and inter-module calls go
only through explicit interfaces (constitution: module boundaries).3. Supporting artifacts of the plan phase
/speckit.plan does not emit plan.md alone. It generates, under specs/001-dq-monitor/, the supporting artifacts that unfold the plan into a verifiable form. These artifacts are exactly the material that lets the next phase (tasks) extract work without guessing.
specs/001-dq-monitor/
├── spec.md # (Part 4) what/why
├── clarifications.md # (Part 4) clarification Q&A
├── plan.md # technical design — how
├── data-model.md # entities & schema
├── contracts/
│ └── api-spec.json # REST contract
├── research.md # technology investigation & trade-offs
└── quickstart.md # setup & validation procedure3.1 data-model.md — entities and schema
The nouns of the spec ("pipeline," "check," "alert") become tables here for the first time.
## Entities
| Entity | Description | Key fields |
|---|---|---|
| Pipeline | A monitored data pipeline | id, name, owner, sla_minutes |
| Check | A quality rule attached to a pipeline | id, pipeline_id, type, params(jsonb), enabled |
| CheckResult | The result of one check evaluation | id, check_id, status, observed(jsonb), evaluated_at |
| Alert | An alert derived from a FAIL result | id, check_result_id, channel, sent_at, dedup_key |
- Check.type ∈ { freshness, consistency, anomaly }
- CheckResult.status ∈ { PASS, FAIL, ERROR }
- Alert.dedup_key = hash(check_id, day-bucket) — suppress to 1 alert/day per checkPinning part of the schema as DDL removes ambiguity in the task phase.
CREATE TABLE check_result (
id BIGSERIAL PRIMARY KEY,
check_id BIGINT NOT NULL REFERENCES check_def(id),
status TEXT NOT NULL CHECK (status IN ('PASS','FAIL','ERROR')),
observed JSONB NOT NULL,
evaluated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_check_result_check_time ON check_result (check_id, evaluated_at DESC);3.2 contracts/api-spec.json — the REST contract
We pin the shape of the API before the code. This contract later becomes the reference point for the test tasks (a contract violation = a failing test).
{
"openapi": "3.1.0",
"info": { "title": "dq-monitor API", "version": "0.1.0" },
"paths": {
"/checks": {
"post": {
"summary": "Create a check definition",
"requestBody": {
"required": true,
"content": {
"application/json": {
"schema": { "$ref": "#/components/schemas/CheckCreate" }
}
}
},
"responses": {
"201": { "description": "Created" },
"422": { "description": "Validation error" }
}
}
},
"/alerts": {
"get": {
"summary": "List alerts",
"parameters": [
{ "name": "since", "in": "query", "schema": { "type": "string", "format": "date-time" } },
{ "name": "status", "in": "query", "schema": { "type": "string", "enum": ["FAIL", "ERROR"] } }
],
"responses": { "200": { "description": "Array of alerts" } }
}
}
},
"components": {
"schemas": {
"CheckCreate": {
"type": "object",
"required": ["pipeline_id", "type", "params"],
"properties": {
"pipeline_id": { "type": "integer" },
"type": { "type": "string", "enum": ["freshness", "consistency", "anomaly"] },
"params": { "type": "object" }
}
}
}
}
}3.3 research.md — technology investigation and trade-offs
This is where the rationale for the plan's "simple statistics-based anomaly detection" decision lives. The point is to record the alternatives and why they were rejected.
## Choosing the anomaly-detection approach
### Candidates
| Approach | Pros | Cons |
|---|---|---|
| Rolling z-score | Simple to implement/explain, works immediately | Sensitive to the distribution assumption (normality) |
| IQR (quartiles) | Robust to outliers, weak distribution assumption | Needs window-size tuning |
| ML (e.g., Isolation Forest) | Captures complex patterns | Burden of training data, ops cost, explainability |
### Decision
Default to **IQR-based** initially, with z-score offered as an option.
- Matches the constitution principle "start with explainable simplicity"
- Insufficient training data in the first operational phase
- Leave extension room by accepting `method` via Check.type=anomaly params
### Open questions
- Default window size (e.g., last 50) to be revisited with quickstart sample data.3.4 quickstart.md — setup and validation procedure
This records "how do you bring it up if you build per this plan, and what confirms it is correct." It is also a rehearsal of the acceptance criteria for when implementation is done.
## Quickstart
### 1. Start dependent services
docker compose up -d kafka postgres
### 2. Migrate & run the service
make migrate
make run # FastAPI(:8000) + Kafka consumer together
### 3. Smoke validation
# create a check
curl -X POST localhost:8000/checks -d '{"pipeline_id":1,"type":"freshness","params":{"sla_minutes":30}}'
# inject a stale event → FAIL → confirm a Slack alert arrives
make seed-stale-event
curl 'localhost:8000/alerts?status=FAIL' # the alert just raised should appear4. /speckit.tasks — breaking the plan into tasks
Once the plan and supporting artifacts are in place, /speckit.tasks reads them and produces an actionable, ordered, dependency-aware task list as tasks.md.
A good task breakdown has two properties.
- Verifiable granularity. Each task must let you clearly judge "done or not." "Implement the check engine" is too big. "freshness rule evaluation function + unit test" is about right.
- Dependency-aware ordering. You can't build the API without the data model, nor route alerts without the check engine. Tasks must follow this causal order.
4.1 tasks.md excerpt — build order
The natural build order for dq-monitor is scaffolding → data model → check engine → alert routing → API → tests. [P] marks tasks that can run in parallel once their prerequisites are done.
# Tasks: dq-monitor (001)
## Phase 0 — Scaffolding
- [ ] T001 Project structure, deps, docker compose (Kafka/PostgreSQL) setup
- [ ] T002 Structured logging, config loader, /metrics endpoint skeleton (constitution: observability)
## Phase 1 — Data model (depends: T001)
- [ ] T003 SQL migrations from data-model.md (pipeline, check_def, check_result, alert)
- [ ] T004 ORM/repository layer + migration-applied test
## Phase 2 — Check engine (depends: T004)
- [ ] T005 [P] freshness rule evaluator + unit tests
- [ ] T006 [P] consistency rule evaluator + unit tests
- [ ] T007 [P] anomaly (IQR) rule evaluator + unit tests (per research.md)
- [ ] T008 CheckEngine dispatcher: event → matching check → persist CheckResult
## Phase 3 — Input pipeline (depends: T008)
- [ ] T009 Kafka consumer: receive/normalize events, call CheckEngine (at-least-once)
## Phase 4 — Alert routing (depends: T008)
- [ ] T010 Notifier interface + SlackWebhookNotifier implementation
- [ ] T011 AlertRouter: route FAIL results + dedup_key suppression
## Phase 5 — API (depends: T004, contracts/api-spec.json)
- [ ] T012 [P] POST /checks (CheckCreate validation → 201/422)
- [ ] T013 [P] GET /alerts (since/status filters)
## Phase 6 — Integration tests (depends: T009, T011, T013)
- [ ] T014 End-to-end: inject stale event → FAIL → Slack alert (quickstart scenario)
- [ ] T015 Contract test: validate response schemas against api-spec.jsonYou can also unfold this as a GFM table — making traceability explicit eases the next phase (analyze).
| ID | Task | Depends | Traces to (requirement/artifact) |
|---|---|---|---|
| T005 | freshness evaluator | T004 | FR-2 (freshness monitoring), data-model |
| T007 | anomaly (IQR) evaluator | T004 | FR-4 (anomaly monitoring), research.md |
| T011 | AlertRouter + dedup | T008 | FR-6 (alerting), data-model: Alert.dedup_key |
| T015 | contract test | T013 | contracts/api-spec.json |
Traceability is the point. Every task must be traceable back to some requirement or artifact. A task that traces nowhere is "work no one knows the reason for," and a requirement with no task is "a promise that won't be implemented." The analyze of the next section catches exactly these two.
5. /speckit.analyze — cross-validating consistency and coverage
The task list is out, but you must not jump to implementation yet. /speckit.analyze is a quality gate that cross-validates the three documents — spec, plan, and tasks. Always run it after tasks, before implement — discovering gaps after you have started writing code costs several times more.
The representative problem types analyze catches are three.
| Problem type | Meaning | Example |
|---|---|---|
| Coverage gap | No task corresponds to a requirement | The spec's "alert on ERROR too" requirement is in no task |
| Untraceable task | A task that traces back to no requirement | An "email alert" task appears that's in neither plan nor spec |
| Contradiction | Plan conflicts with spec (or tasks) | Spec says "1 alert/day," plan says "alert on every FAIL" |
5.1 Example analyze report
# Analyze Report — 001-dq-monitor
## ✅ Consistent (summary)
- FR-1/2/3/4 (pipeline/3 check types) → covered by T003–T008
- Constitution "observability first" → reflected in T002
## ⚠️ Issues found
| # | Severity | Type | Detail |
|---|---|---|---|
| 1 | HIGH | Coverage gap | No task corresponds to spec FR-7 "notify operators on CheckResult.status=ERROR too." AlertRouter (T011) routes only FAIL. |
| 2 | MED | Contradiction | Spec says "1 alert/day per check" (dedup=day-bucket), but plan.md's Alert Router section describes a "dedup window" of 5 minutes. The dedup definitions disagree. |
| 3 | LOW | Untraceable task | T013's GET /alerts has a `status` filter, but the spec states no alert-filter requirement (nice-to-have vs out-of-scope call needed). |
## Recommendations
- Issue 1: extend T011 to "route FAIL+ERROR" or add a new task.
- Issue 2: unify the dedup criterion across spec/plan/data-model (day-bucket recommended).
- Issue 3: add an alert-query requirement to the spec or drop the filter from the task.Issue 2 shows the real value of analyze. data-model.md (dedup_key = day-bucket), plan.md (5-minute window), and spec (1 alert/day) were subtly out of step. analyze reveals in one pass the kind of inconsistency that is hard for a human to catch reading three documents back and forth. Had you gone into implementation without this step, you would have hit "wait, what was the criterion?" only after coding the entire dedup logic.
analyze does not fix. It reveals. How to resolve a found issue (which document to treat as truth) is the human's call. Usually you regress toward the spec and unify it as the source of truth.
6. The whole flow at a glance
Here is how the phases so far flow from which inputs to which outputs, and where analyze acts as a gate.
The key is the direction of the arrows. Everything starts from spec.md + constitution.md and grows steadily more concrete, and analyze sends back upward the misalignments newly introduced during that concretization. Only after passing the gate do you descend to implementation.
7. Common pitfalls
Using a tool honestly means knowing its pitfalls too. Here are three we see repeatedly in the plan/tasks phase.
Pitfall 1 — Planning before clarifying
The most expensive mistake. If you run /speckit.plan with ambiguity still in the spec, the agent fills the blanks however it likes as it designs. The freshness threshold ("30 minutes or 24 hours?") is undecided, yet the plan fixes the architecture on an arbitrary value, tasks pile on top, and code climbs over that. A tower built on a wrong assumption hurts more the higher it falls. This is exactly why Part 4's /speckit.clarify comes before plan.
Pitfall 2 — Tasks too coarse to verify
If tasks.md contains one-line monsters like "build the check engine" or "implement the API," those tasks are impossible to judge done. Tasks you can't judge done make progress lie ("we're 80% there" stays 80% forever), and analyze loses the unit at which to check traceability. The right task size is "one person finishes in half a day to a day, and can show it's done with a test."
Pitfall 3 — Skipping analyze
The temptation is strong: "tasks are out, let's just implement." But skip analyze and you discover problems like the dedup mismatch above in the middle of implementation. By then you already have code built on a wrong assumption, and you must fix both documents and code. analyze takes 30 seconds to a few minutes, but the rework it prevents takes hours. The gate is insurance, not a toll.
| Pitfall | Symptom | Remedy |
|---|---|---|
| Plan before clarify | plan gets baseless arbitrary values baked in | Clarify first, state constraints in the plan prompt |
| Coarse tasks | Progress lies, completion can't be judged | Size tasks so a test can prove the end |
| Skip analyze | Inter-document contradictions found mid-implementation | Make analyze a mandatory gate right after tasks |
Wrapping up
In this post we turned the spec for dq-monitor into an actionable blueprint and task list. /speckit.plan drew out the tech stack and architecture, and data-model.md, contracts/, research.md, and quickstart.md unfolded that design into a verifiable form. /speckit.tasks broke it into dependency-ordered tasks, and /speckit.analyze revealed the misalignments among the three documents before implementation.
We now hold a numbered, traceable, consistency-validated task list. No more agonizing over "where do I start coding." In the next Part 6, Implement & Converge, /speckit.implement turns these tasks into actual code in order, ties into GitHub issues, and /speckit.converge reconciles artifacts against the codebase to recover any missed work. That's the moment the spec finally becomes a running service.
References