Understanding Prometheus Metrics — metric name, labels, instance/job, and gauge vs counter
A step-by-step guide to why Prometheus metrics feel confusing. We break down the anatomy of a single metric line, naming rules, labels and cardinality, where instance and job actually come from, the difference between counters and gauges, and how to set values from a client library — all with code.
When you first work with Prometheus, you tend to get stuck in the same places. "Why does this metric name end in _total?", "Where did the instance and job labels come from — I never created them?", "What's the difference between a counter and a gauge, and how do I set the value?"
This post untangles that confusion from the ground up. We start with what a single metric line looks like, then cover naming and label rules, the true origin of instance/job, the meaning of each metric type, and finally how to set values with real code.
1. The anatomy of a metric line
The data Prometheus collects is, in the end, a line of text. An exporter exposes the format below (the exposition format) on a /metrics endpoint, and Prometheus periodically scrapes it.
# HELP http_requests_total Cumulative number of HTTP requests handled
# TYPE http_requests_total counter
http_requests_total{method="POST",code="200"} 1027
http_requests_total{method="POST",code="400"} 3Breaking down the last data line:
http_requests_total {method="POST",code="200"} 1027 [timestamp]
└── metric name ──┘ └──────── labels ────────┘ value (optional, usually omitted)- metric name — what you measure (
http_requests_total) - labels — the dimensions you slice by (
method,code) - value — the current number (float64). The timestamp is usually omitted, and Prometheus stamps the scrape time.
# HELP is a human-readable description, # TYPE declares the metric type. Both appear once per metric.
Key point: a single
metric_name{label="value"} valueis one time series. Change the label combination and you get an entirely separate time series — that fact is the starting point for the cardinality discussion below.
2. metric name — naming rules
You can name metrics freely, but following the conventions makes your PromQL queries and dashboards far cleaner.
- snake_case, ASCII letters/digits/underscores only. (Don't use the colon
:in your own metrics — it's reserved for recording rules.) - Use a namespace (prefix): put the application/subsystem up front, like
http_requests_total,process_cpu_seconds_total,node_memory_MemFree_bytes. - Use base units. Seconds, not milliseconds (
_seconds); bytes, not megabytes (_bytes); ratios as 0–1. The unit goes at the end of the name. - Counters take the
_totalsuffix. e.g.app_errors_total. _count,_sum,_bucketare reserved suffixes that histograms/summaries generate automatically — don't use them yourself.
The goal is that the name alone reveals "what, in which unit, of which type." request_latency_seconds is good; latencyMs is bad.
3. labels — dimensions, and the cardinality trap
Labels are key-value pairs that slice the same metric into multiple dimensions. You split one http_requests_total by method, code, endpoint, and so on.
http_requests_total{method="GET", endpoint="/api/users", code="200"} 8123
http_requests_total{method="POST", endpoint="/api/users", code="201"} 412
http_requests_total{method="GET", endpoint="/api/orders",code="500"} 7The most common accident here is cardinality explosion. Since every combination of label values is a separate time series, attaching a label whose set of values grows without bound will blow up Prometheus memory.
- Bad labels:
user_id,email,request_id,full_url(with query string), timestamps — effectively infinite values - Good labels:
method(GET/POST/…),status_code,region,endpoint(a templated path) — a small, finite set of values
Rule of thumb: design so that the total number of label combinations for one metric stays under a few thousand. Always ask, "will this value still be finite a few days from now?"
For reference, labels starting with __ (__name__, __address__, etc.) are internal reserved labels in Prometheus. In fact, the metric name itself is internally the __name__ label.
4. Where instance and job come from (the most confusing part)
The metrics an exporter exposes had neither instance nor job. Yet when you query in Prometheus you see this:
http_requests_total{job="my-app", instance="10.0.0.1:8000", method="GET", code="200"} 8123job and instance are "target labels" that Prometheus attaches at scrape time — not the exporter. They come from the scrape configuration.
# prometheus.yml
scrape_configs:
- job_name: "my-app" # → adds job="my-app" to every metric
static_configs:
- targets:
- "10.0.0.1:8000" # → instance="10.0.0.1:8000"
- "10.0.0.2:8000" # → instance="10.0.0.2:8000"job— thejob_namefrom the scrape config becomes the label as-is. "Which group of services did this metric come from."instance— by default the target address (host:port, internal label__address__). "Exactly which process/host within that group."
That's why you don't set instance/job in exporter code (doing so causes a conflict). If you want instance to be a hostname instead, use relabeling.
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: "web-01"Prometheus also auto-generates an up metric per target — 1 if the scrape succeeded, 0 if it failed. It too carries the job/instance labels.
up{job="my-app", instance="10.0.0.1:8000"} 1Summary: the metric name and labels like method/code are set by "you" (the exporter);
job/instance/upare attached by "Prometheus." Hold that boundary and half the confusion disappears.
5. Metric types — counter vs gauge (vs histogram/summary)
How a value should be interpreted is determined by its type. Start with the two you'll use most.
Counter — monotonically increasing (cumulative)
- Only ever increases (resets to 0 on process restart). The absolute value means little; what matters is the rate of change.
- Examples:
http_requests_total,errors_total,bytes_sent_total - Queries almost always wrap it in
rate()/increase(). "How many per second right now?"
Gauge — a current value that goes up and down
- A snapshot value that rises and falls, meaning the state at that instant.
- Examples:
node_memory_MemAvailable_bytes,queue_length,temperature_celsius,inprogress_requests - Query the value directly, or with
avg_over_time(),max_over_time(), etc.
The key distinguishing question: "can this number go down?" If it can, it's a gauge; if it only accumulates upward, it's a counter. Queue length is a gauge; the cumulative count of items that entered the queue is a counter.
Histogram / Summary — distributions (latency, etc.)
Use these when you care about a "distribution," like response times. A Histogram accumulates observations into buckets (le, less-than-or-equal) and automatically exposes _bucket / _sum / _count series. Quantiles are computed at query time with histogram_quantile().
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 240
http_request_duration_seconds_bucket{le="0.5"} 290
http_request_duration_seconds_bucket{le="+Inf"} 300
http_request_duration_seconds_sum 48.2
http_request_duration_seconds_count 300A Summary computes quantiles on the client side (with the downside that they can't be aggregated). In most cases a histogram is recommended because it can be aggregated server-side.
6. How to set values (client libraries)
You can emit the text directly, but normally you create metric objects with an official client library and call methods. The allowed operations differ by type — that's another spot where people get confused at first.
from prometheus_client import Counter, Gauge, Histogram
# Counter — increment only (.inc); no set/dec
REQUESTS = Counter(
"app_requests_total", "Requests handled", ["method", "endpoint"]
)
REQUESTS.labels(method="GET", endpoint="/api").inc() # +1
REQUESTS.labels(method="GET", endpoint="/api").inc(3) # +3
# Gauge — set/inc/dec freely
INPROGRESS = Gauge("app_inprogress_requests", "Requests in progress")
INPROGRESS.set(0)
INPROGRESS.inc() # +1
INPROGRESS.dec() # -1
INPROGRESS.set(42) # set an absolute value
INPROGRESS.set_to_current_time() # current time (unix ts)
# Histogram — feed observations (.observe)
LATENCY = Histogram(
"app_request_duration_seconds", "Request processing time",
buckets=[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
)
LATENCY.observe(0.23)Key points:
- Counter offers only
.inc()/.inc(n). You don't "set" the value (it's cumulative). - Gauge supports
.set()/.inc()/.dec(). It's a state value. - Histogram takes samples with
.observe(value), and buckets/sum/count update automatically.
For metrics with labels, always pick the label values first with .labels(...), then operate. For timing, decorators/context managers are handy.
# Automatically observe a function's execution time
@LATENCY.time()
def handle():
...
# Or per block
with LATENCY.time():
do_work()
# Track in-progress concurrency on a Gauge
with INPROGRESS.track_inprogress():
do_work()7. Exposing it yourself (a minimal exporter)
Here's the smallest example that exposes the metrics above over HTTP. start_http_server brings up the /metrics endpoint.
import random, time
from prometheus_client import start_http_server, Counter, Gauge
REQUESTS = Counter("app_requests_total", "Requests", ["code"])
INPROGRESS = Gauge("app_inprogress_requests", "Requests in progress")
if __name__ == "__main__":
start_http_server(8000) # http://localhost:8000/metrics
while True:
INPROGRESS.inc()
time.sleep(random.random())
REQUESTS.labels(code="200").inc()
INPROGRESS.dec()After it's up:
curl -s localhost:8000/metrics | grep app_Register this port 8000 as a target in the scrape_configs shown earlier, and Prometheus will scrape it and attach job/instance as it stores the data.
8. Reading it with PromQL — querying by type
How you read a metric depends on its type.
# Counter: look at the per-second rate, not the absolute value
rate(app_requests_total[5m])
# Aggregate by endpoint
sum by (endpoint) (rate(app_requests_total[5m]))
# Error rate = ratio of 5xx
sum(rate(app_requests_total{code=~"5.."}[5m]))
/ sum(rate(app_requests_total[5m]))
# Gauge: take the value directly
app_inprogress_requests
# Histogram: 95th percentile latency
histogram_quantile(
0.95,
sum by (le) (rate(app_request_duration_seconds_bucket[5m]))
)If you graph a counter's raw value without rate(), you'll just see an "ever-climbing sawtooth" — rate()/increase() is almost always the answer.
9. Common points of confusion
- Why the
_totalsuffix? It's the conventional suffix for counters. It signals "cumulative total." - Do I need to set
instance/jobin my code? No. Prometheus attaches them at scrape time. - My counter went down. It reset to 0 on a process restart.
rate()automatically compensates for that reset. - Can I use
rate()on a gauge? It's meaningless. Read gauges with the raw value or*_over_time(). - Why seconds and bytes for units? It's the ecosystem standard. Exposing ms/MB won't line up with other dashboards and rules.
- I want to put an ID in a label. Stop. That's the shortcut to a cardinality explosion. High cardinality belongs in logs/traces.
Wrapping up
A Prometheus metric ultimately reduces to a single line: metric_name{labels} value. Add (1) the naming/unit conventions, (2) labels and cardinality, (3) the fact that job/instance are attached by Prometheus, (4) the semantic difference between counter/gauge/histogram, and (5) how to set values per type — get those five and the initial confusion mostly disappears.
As a next step, connect the same exporter to a Grafana dashboard and bundle a counter-based error rate, gauge-based saturation, and histogram-based latency percentiles onto one screen. "Naming metrics well" is ultimately 80% of good observability.