Blog
spec-kitspec-driven-developmentrequirementsai-agentclaude-codeai

[Spec Kit Part 4] Specify & Clarify — Writing the Spec and Removing Ambiguity

Write a spec that focuses purely on the what and why with /speckit.specify, fill the gaps with sequential questioning via /speckit.clarify, and gate quality with /speckit.checklist — all on the dq-monitor example.

Data DynamicsJune 14, 202619 min read

In Part 3 we established the constitution for dq-monitor (a real-time data quality monitoring service). Cross-cutting principles like code quality, testing, and observability are now baked into .specify/memory/constitution.md. Now it's time to write, on top of those principles, what this service must do and why it's needed. And this is exactly where the most common mistake begins — you're asked to write a spec, and you start by writing the tech stack. This post is about writing a spec properly, and then stripping ambiguity out of it before you enter planning.

What you'll learn in this post

  • The golden rule of /speckit.specify: focus on the what and why (requirements, user stories) and defer the how (tech stack), and why mixing them is harmful
  • A realistic spec (spec.md) for dq-monitor — user stories, functional requirements, and testable acceptance criteria
  • How /speckit.clarify's sequential, coverage-based questioning sharpens a vague requirement into a precise one
  • Building a custom checklist with /speckit.checklist to validate the quality of the spec itself
  • The specify → clarify → checklist loop feeding into planning, plus common pitfalls

This is Part 4 of the Spec Kit series. The constitution we set up in Part 3 acts as a sturdy guardrail for this phase.


1. The golden rule of /speckit.specify — focus only on the what and why

The mistake people make most often when writing a spec is to write down the solution already forming in their head. "Store metrics in PostgreSQL, stream events through Kafka, show them in a React dashboard" — that isn't a spec, it's a design. And writing the design at this stage breaks two things.

First, it closes the solution space too early. The moment you nail "stream events through Kafka" into the spec, the fact that what you actually needed was polling a few dozen times a second never gets a chance to be reconsidered. If you commit to how before agreeing on what, simpler or more fitting alternatives vanish from view.

Second, it blurs the basis for verification. The whole point of a spec is to create a reference against which you can judge "does this code satisfy the intent?" But once "uses Kafka" enters the spec, that's an implementation, not an intent. Later you either get meaningless reviews ("you didn't use Kafka, so it violates the spec") or, worse, the real intent ("events must be processed within 5 seconds") goes missing from the spec entirely.

So Spec Kit's /speckit.specify demands just one thing.

Write only the What and the Why. Defer the How to the /speckit.plan phase.

Belongs in the spec (What/Why)Keep out of the spec (How)
User stories ("As a data engineer, I want to…")Framework / library choices
Functional requirements (freshness checks, anomaly alerts…)Database type / schema DDL
Non-functional requirements (latency budgets, availability targets)API endpoint signatures / class structure
Testable acceptance criteriaDeployment topology / infrastructure config
Domain term definitionsNaming specific middleware (queues, caches, message buses)

This discipline can feel restrictive. But recall the core insight of SDD: it doesn't generate in one shot. The spec phase is about fixing intent into a form a human can review, not about writing code. Technical decisions get their own phase (/speckit.plan) with the weight they deserve.


2. Running /speckit.specify

In the dq-monitor project — already set up with a constitution from Part 3 — call /speckit.specify in Claude Code and throw your intent at it in natural language. The key is to phrase the prompt in the language of what and why.

/speckit.specify I want to build a service that monitors, in real time, the
health of the data pipelines a data engineer operates. There are three core ideas.
(1) Each dataset's "freshness" — detect when the last update is later than its expected cadence.
(2) "Schema/integrity" — detect structural anomalies like missing columns, type changes, or null-ratio spikes.
(3) "Anomaly alerting" — when the signals above cross a threshold, send an alert to the right channel.
A data engineer must be able to configure per-dataset thresholds and notification
channels (e.g. email, Slack) themselves.
Don't pick a tech stack yet. Write the spec focusing only on the what and why.

When the command finishes, Spec Kit creates a per-feature artifact folder and generates the spec file inside it.

specs/
└── 001-dq-monitor/
    └── spec.md          # ← /speckit.specify output (created now)

The 001- prefix in the folder name is the feature number. A single repository can handle multiple features under SDD, so each feature gets its own numbered folder (002-..., 003-...). Within the same folder, later artifacts pile up: clarifications.md, plan.md, tasks.md, data-model.md, contracts/, research.md, quickstart.md.

The generated spec.md won't be perfect on the first pass. That's normal. specify produces a draft, and the gaps in that draft are filled by /speckit.clarify in the next section. First, let's see what that draft looks like.


3. An example spec for dq-monitorspec.md

Below is a realistic example of specs/001-dq-monitor/spec.md, generated from the prompt above and tidied up. Note deliberately that there is not a single line of tech stack in it. Freshness, integrity, thresholds, and alerts are all described purely in the language of "what/why."

# Feature Specification: Real-Time Data Quality Monitoring (dq-monitor)
 
## 1. Overview
 
A service that continuously monitors the health of the datasets and pipelines a
data engineer operates, detects anomalous signals via thresholds, and sends
alerts to configured channels. The goal: "the data team knows about an incident
before its consumers do."
 
## 2. Glossary
 
- **Dataset**: A logical unit of data under monitoring (table, topic, file path, etc.).
- **Freshness**: Time elapsed since a dataset was last updated.
- **Expected Cadence**: The interval at which a dataset is expected to update (e.g. hourly, daily at 02:00).
- **Integrity**: The degree to which a dataset's schema and value distribution follow an agreed contract.
- **Channel**: A destination an alert is delivered to (e.g. email, messenger).
 
## 3. User Stories
 
- **US-1** As a data engineer, I want to monitor each dataset's last update time,
  so that I can respond before metrics go stale when a pipeline stalls.
- **US-2** As a data engineer, I want to detect schema changes (column add/drop/type
  change) and value anomalies (e.g. null-ratio spikes), so that I can stop
  breakage before it reaches downstream consumers.
- **US-3** As a data engineer, I want to set thresholds per dataset, because the
  normal range differs from one dataset to another.
- **US-4** As a data engineer, I want to be alerted through my chosen channels when
  an anomaly is detected, so that I don't have to keep watching a dashboard.
- **US-5** As an on-call engineer, I want repeated alerts for the same anomaly not
  to flood me, so that I can focus on genuinely new problems.
 
## 4. Functional Requirements
 
### FR-1 Freshness Checks
- FR-1.1 The system periodically collects each dataset's last update time.
- FR-1.2 If elapsed time since the last update exceeds the dataset's expected
  cadence plus a grace period, it is judged a "freshness anomaly."
- FR-1.3 Expected cadence and grace period must be configurable per dataset.
 
### FR-2 Schema & Integrity Checks
- FR-2.1 The system compares a dataset's current schema against an agreed baseline schema.
- FR-2.2 Column additions, removals, and type changes are detected as structural anomalies.
- FR-2.3 Distribution metrics for key columns (null ratio, distinct count, etc.) are
  collected, and exceeding configured limits is judged an integrity anomaly.
 
### FR-3 Anomaly Thresholds
- FR-3.1 Each check judges normal/anomalous according to the dataset's configured threshold.
- FR-3.2 Thresholds support both absolute values (e.g. null ratio above 5%) and
  relative values (e.g. 3x the trailing 7-day average).
- FR-3.3 When no threshold is set, a system default applies, and the default is documented.
 
### FR-4 Alert Routing
- FR-4.1 When an anomaly is judged, an alert is sent to the channels configured for that dataset.
- FR-4.2 An alert includes the dataset, anomaly type, detection time, observed value, and threshold.
- FR-4.3 Alerts for the same dataset and same anomaly type are deduplicated within a
  configured time window and sent only once.
- FR-4.4 When an anomaly clears, a "resolved" alert is sent.
 
### FR-5 Configuration
- FR-5.1 A data engineer can configure the dataset list and, per dataset, the checks,
  thresholds, and channels.
- FR-5.2 Configuration changes must take effect without restarting the service.
- FR-5.3 Invalid configuration (e.g. referencing a non-existent channel) is rejected
  before being applied, with a reason reported.
 
## 5. Non-Functional Requirements
 
- NFR-1 Latency from anomaly detection to alert dispatch is within the check interval + 1 minute.
- NFR-2 When monitoring 100 datasets concurrently, one round of checks completes within 5 minutes.
- NFR-3 On alert dispatch failure, retry at least 3 times; if it still fails, record it.
- NFR-4 Every judgment must be traceable: which input led to which conclusion.
 
## 6. Acceptance Criteria
 
- AC-1 (US-1) For a dataset with a 1-hour expected cadence that goes 90 minutes without
  an update, under a 10-minute grace setting, exactly 1 freshness anomaly is judged.
- AC-2 (US-2) When a column not in the baseline schema is added, a structural anomaly is
  judged and the alert includes the added column name.
- AC-3 (US-3) For the same input, dataset A (threshold 5%) and B (threshold 20%) receive
  different judgments.
- AC-4 (US-4) On an anomaly judgment, alerts are sent to all channels configured for that
  dataset, and not to unconfigured channels.
- AC-5 (US-5) Even if the same anomaly is detected 5 times within the dedupe window
  (e.g. 30 minutes), the alert is sent only once.
- AC-6 (FR-4.4) When an anomalous state returns to normal, a resolved alert is sent once.
 
## 7. Out of Scope
 
- Automatic root-cause analysis of anomalies.
- Automatic correction or backfill of the data itself.
- Retroactive monitoring of historical data (the initial version starts from "now").
 
## 8. Open Questions  ← resolved in the clarify phase
 
- Q1 Is "expected cadence" set manually per dataset, or inferred from past update history?
- Q2 What is the default value for the alert dedupe window?
- Q3 What is the range of supported channels (email only? messenger included)?
- Q4 Is a check failure (data source unreachable) an "anomaly" or a separate "error"?

Two things to notice in this spec.

  1. The acceptance criteria are testable. Instead of a vague sentence like "detects freshness well," it states inputs and expected outputs: "1-hour cadence, 10-minute grace, 90 minutes without update → 1 anomaly." This translates directly into tests later in /speckit.tasks.
  2. There's an Open Questions section at the end. specify doesn't pretend to know what it doesn't. It honestly leaves the gaps, and those gaps get filled by /speckit.clarify in the next section.

4. /speckit.clarify — strip ambiguity before planning

A spec draft almost always has gaps: assumptions a human deemed "obvious" and left out, sentences that read two ways, thresholds missing their numbers. Carry those gaps into /speckit.plan and the AI will fill them in however it likes during planning. And those improvised decisions flow all the way into code, later losing all accountability when someone asks "wait, who decided it should work like this?"

/speckit.clarify solves this before you enter planning. Here's how it works.

  • Sequential: It doesn't drop a bomb of questions at once; it takes one answer at a time and adjusts the next question.
  • Coverage-based: It sweeps the spec by area (data model, error handling, edge conditions, non-functional requirements, etc.) and interrogates the least-specified areas first.
  • Recorded: Your answers are written as decisions into specs/001-dq-monitor/clarifications.md, becoming the official record that later phases reference.

Example Q&A flow

/speckit.clarify
 
[Clarify] I've reviewed the spec. I'll question the ambiguous or under-specified areas in order.
 
Q1 (area: Configuration / FR-1.3) Is "expected cadence" set explicitly per dataset
   by the data engineer, or auto-inferred from past update history?
   - (A) Always manual
   - (B) Auto-inferred, manual override allowed
   - (C) Both supported, default is manual
 
> B
 
  ✓ Recorded: Expected cadence is inferred from past update history, with a per-dataset
    manual override allowed. Inference requires at least N prior updates; if history is
    insufficient, manual configuration is required. (FR-1.3 updated)
 
Q2 (area: Alerts / FR-4.3) What is the default value for the alert dedupe window?
   And can this value be overridden per dataset?
 
> Default 30 minutes, overridable per dataset
 
  ✓ Recorded: dedupe default window = 30 minutes, per-dataset override allowed. (FR-4.3 updated)
 
Q3 (area: Alert channels / FR-4.1) What range of channels will the initial version support?
   - (A) Email only
   - (B) Email + messenger (webhook)
   - (C) Arbitrary webhook (generic)
 
> B
 
  ✓ Recorded: Initial channels are email and messenger webhook. Generic webhooks are out of scope. (FR-4.1 updated)
 
Q4 (area: Edge conditions) If accessing the data source itself fails (e.g. permission,
   network), is that treated as an "integrity anomaly" or a separate "check error"?
 
> A separate check error. Keep it distinct from anomalies so it can route to a different channel.
 
  ✓ Recorded: A check failure is treated as a "check error" category, distinct from data
    anomalies. Error alerts can be routed separately from anomaly alerts. (new FR-6 added)
 
[Clarify] 4 key ambiguities resolved. Recorded in clarifications.md and updated the
Open Questions in spec.md.

Vague → precise: before / after

Let's compress clarify's effect into a single requirement.

Content
Before (vague)FR-1.3 "Expected cadence and grace period must be configurable per dataset." — unclear who sets it, whether inference exists, or what happens with no history
After (precise)FR-1.3 "Expected cadence is auto-inferred from at least N prior updates, with a per-dataset manual override allowed. If history is below N, no inference is made and manual configuration is required. Grace period is always manual, defaulting to 0."

The After translates straight into data-model and tests. The N parameter the inference needs, the override field, the "insufficient history" branch — the planning phase doesn't have to guess any of it. One sentence answered in clarify pre-pins ten decisions that would otherwise wobble in plan.

Skipping clarify doesn't make the cost disappear — it defers it to the planning phase. The AI improvises to fill the gaps in plan, that improvisation propagates into tasks and code, and the bill arrives at the most expensive place of all (post-implementation review).


5. /speckit.checklist — gate the quality of the spec itself

If clarify is the phase that "fills what's missing," /speckit.checklist is the phase that "checks whether what's filled is enough." The command reads the spec and the clarification results and generates a custom checklist that validates requirements and clarity. The key is that this checklist inspects the spec, not the code — it's a quality gate asking "is this spec solid enough to hand to planning?"

/speckit.checklist Generate a quality checklist that validates whether this spec
is ready to move into production planning. Focus on requirement completeness,
testability, and the absence of ambiguity.

Example generated checklist (saved under specs/001-dq-monitor/):

# Spec Quality Checklist: dq-monitor (001)
 
## A. Requirement Completeness
- [ ] Every user story has a corresponding functional requirement
- [ ] All 5 core areas — freshness, integrity, anomaly, alerting, configuration — are covered
- [ ] Behavior for check failure (data source unreachable) is defined
- [ ] Resolved-alert behavior is defined
 
## B. Clarity / Ambiguity Removal
- [ ] The way "expected cadence" is decided (manual/auto/hybrid) is specified
- [ ] Every threshold has a unit and a default value
- [ ] The dedupe window's default and per-dataset overridability are specified
- [ ] The range of supported alert channels is closed (no open-ended "etc.")
- [ ] All Open Questions are resolved and the section is empty
 
## C. Testability
- [ ] Every acceptance criterion is stated as inputs and expected outputs
- [ ] Acceptance criteria contain measurable numbers (time, ratio, count)
- [ ] Expected behavior is defined for edge conditions (insufficient history, no threshold set)
 
## D. Non-Functional / Operational
- [ ] The detection→alert latency budget is defined numerically
- [ ] A performance target at the 100-dataset scale exists
- [ ] An alert-failure retry policy is defined
- [ ] The constitution's observability principle (traceability) is reflected as an NFR
 
## E. Scope
- [ ] Out-of-scope items are stated to block over-implementation
- [ ] No out-of-scope item contradicts any user story

Only when every item is checked does the spec earn the right to move into planning. Any unchecked item? That's a signal to go back to specify or clarify.

Constitution · clarify · checklist as one quality gate

The three tools aren't independent — they're a single set of gates you turn on for ambiguous or production-grade work.

GateWhat it guaranteesWhen to turn it on
Constitution (Part 3)Every phase follows the same principles (quality, testing, observability)Team / long-running projects
clarifyNo improvisable gaps remain in the specWhen the requirements have ambiguity
checklistThe spec is solid enough to hand to planningFeatures headed for production

As Part 1 stressed, these gates are optional quality gates, not mandatory tolls. Turn them off for a weekend prototype; turn them on for a production feature that lives for days. Judging when to open and close the gates by the weight of the work is precisely the skill of using SDD well.


6. The specify → clarify → checklist loop

Before moving on to planning, the three commands form a single loop that hardens the spec.

Loading diagram…

The loop's exit condition is clear. Only when every checklist item is checked and Open Questions is empty do you move on to plan. Until then, you go back and forth between specify ↔ clarify, refining the spec. This round-tripping may look tedious, but each piece of ambiguity you shave off here defuses a bomb that would otherwise detonate during planning and implementation.


7. Common pitfalls

Three failures repeatedly observed in the spec phase, with prescriptions.

Pitfall 1 — Leaking tech decisions into the spec

The most common. Sentences like "cache thresholds in Redis" or "stream events through a Kafka topic" sneak into the spec. This violates the golden rule of section 1 head-on.

  • Symptom: Product names, frameworks, databases, or middleware appear in the spec.
  • Prescription: Re-translate that sentence into "what is needed." "Stream through Kafka" → "events must be processed within 5 seconds of occurring." Move the tech decision to plan.

Pitfall 2 — Skipping clarify and paying for it in planning

The temptation of "the spec is roughly done, let's go straight to plan." But as we saw in section 4, unfilled gaps don't disappear — they're deferred to the planning phase. The AI improvises to fill them, and that improvisation propagates into tasks and code.

  • Symptom: Decisions that appear nowhere in the spec show up in the plan/tasks output ("wait, who decided this retry policy?").
  • Prescription: If the work has meaningful ambiguity, gate it with clarify. The cost of clarify is always cheaper than the rework cost after plan.

Pitfall 3 — Untestable acceptance criteria

Criteria like "detects freshness well" or "alerts go out appropriately" can't be verified, because there's nothing to decide pass/fail by.

  • Symptom: Acceptance criteria carry only adjectives like "well," "appropriately," "fast," with no numbers, inputs, or expected outputs.
  • Prescription: Rewrite every acceptance criterion as (given input) → (expected output). "Fast" → "within the check interval + 1 minute." Only in this form does it translate directly into a test in /speckit.tasks.
Anti-patternRewrite
"Detects anomalies well""With threshold 5% and observed 7%, exactly 1 anomaly is judged"
"Alerts are sent appropriately""On a judgment, sent to all configured channels, not to unconfigured ones"
"Performance is sufficient""One round of checks over 100 datasets completes within 5 minutes"

Wrapping up

The job of the spec phase is not to write code but to fix the what and why into a form a human can review. You write a tech-stack-free draft with /speckit.specify, fill the gaps before planning with /speckit.clarify, and gate the spec's solidity with /speckit.checklist. A spec that has passed these three phases becomes a firm contract the next phase can work against without guessing.

In Part 5 we finally move on to the How. We'll design dq-monitor's architecture and tech stack with /speckit.plan, break that plan into executable, dependency-ordered work with /speckit.tasks, and then cross-check that spec, plan, and tasks don't contradict each other with /speckit.analyze. The spec we shaved sharp in Part 4 is what makes all of that shine.

References