Blog
kafkaperformancetuningbenchmarkingproducerconsumer

[Kafka Performance 1] Measure, Don't Guess — Benchmarking and Producer/Consumer Parameters

Kafka performance tuning starts with measurement, not guesswork. Establish a baseline with kafka-producer-perf-test and kafka-consumer-perf-test, adopt a percentile-based methodology that changes one variable at a time, and learn the producer and consumer parameters that drive throughput and latency.

Data DynamicsJune 20, 202612 min read

When someone says "I bumped linger.ms to 20 and it got faster," the first thing I want to ask is: "Faster compared to what? Throughput, or p99 latency? Were the message sizes the same as production?" Performance tuning without a baseline is ultimately guessing. Sometimes you get lucky; most of the time you improve one metric while quietly degrading another. This is Part 1 of a three-part Kafka performance mini-series, and it's about the discipline of measuring first, then changing exactly one variable at a time.

What you'll learn in this post

  • Why measurement must come before tuning — throughput, latency, and durability are a trade space, not a single dial
  • How to establish a baseline with kafka-producer-perf-test.sh / kafka-consumer-perf-test.sh and read their output
  • A benchmarking methodology that measures p99/p999 percentiles (not averages) and changes one variable at a time
  • The producer parameters that move throughput and latency (batch.size, linger.ms, acks, and more)
  • The consumer fetch parameters that are surprisingly easy to overlook (fetch.min.bytes, fetch.max.wait.ms, max.poll.records, and more)

1. Why Measure First

Performance Is Not a Single Dial

When you first approach performance tuning, it's tempting to picture a single goal: "make it faster." But in Kafka, "fast" hides three distinct axes.

  • Throughput: records or bytes processed per second (records/sec, MB/sec)
  • Latency: the time for a message to travel from producer to consumer, especially the tail (p99/p999)
  • Durability: how well messages survive failures (acks, replication, min.insync.replicas)

These three form a trade space. Larger batches and a higher linger.ms raise throughput but increase per-message latency. Setting acks=all for durability forces the producer to wait for ISR replication, increasing latency. Pull one lever and another moves. That's why "I tuned it and it got faster" must always come with the question "at the cost of what?"

Loading diagram…

The Cost of Guessing

Without a baseline, you fall into these traps.

  • Confirmation bias: you want the parameter you changed to have worked, so you read a slight improvement in the average as "an improvement."
  • Confounding variables: change two parameters at once and you'll never know which one mattered.
  • Unrealistic workloads: a value tuned with 100-byte messages can behave in exactly the opposite way under production's 10KB messages.

The point of measuring isn't to find "the correct settings" — it's to causally isolate the effect of a change. That requires a reproducible baseline and the right tooling.


2. Benchmarking Tooling

Kafka ships performance CLIs in the distribution. You can establish a baseline with no extra installation.

kafka-producer-perf-test.sh

Measures the throughput and latency the producer can deliver. The key options:

OptionMeaning
--num-recordsTotal number of records to send
--record-sizeRecord size in bytes (match production)
--throughputTarget throughput cap (records/sec). -1 means unlimited (saturation test)
--producer-propsProducer settings such as acks, batch.size, linger.ms, compression.type
--print-metricsPrint producer internal metrics on exit

Example — send 5 million 1KB records with no rate cap to measure producer saturation throughput.

kafka-producer-perf-test.sh \
  --topic perf-test \
  --num-records 5000000 \
  --record-size 1024 \
  --throughput -1 \
  --producer-props \
    bootstrap.servers=broker1:9092,broker2:9092,broker3:9092 \
    acks=all \
    batch.size=16384 \
    linger.ms=5 \
    compression.type=lz4

Sample output (trailing summary line):

5000000 records sent, 412371.4 records/sec (402.71 MB/sec),
  18.43 ms avg latency, 412.00 ms max latency,
  12 ms 50th, 35 ms 95th, 78 ms 99th, 142 ms 99.9th.

What matters here is not the average latency (18.43 ms) but the percentiles. If p50 is 12 ms while p99 is 78 ms and p999 is 142 ms, most requests are fast but 1% are 6x slower or more. In production, what users feel is that tail.

kafka-consumer-perf-test.sh

Measures how fast a consumer can pull data from the brokers.

kafka-consumer-perf-test.sh \
  --bootstrap-server broker1:9092,broker2:9092,broker3:9092 \
  --topic perf-test \
  --messages 5000000 \
  --fetch-size 1048576 \
  --consumer.config consumer-perf.properties \
  --show-detailed-stats \
  --reporting-interval 1000

Put consumer fetch parameters such as fetch.min.bytes, fetch.max.wait.ms, and max.partition.fetch.bytes in consumer-perf.properties (covered in detail in Section 5). The output is CSV showing per-interval MB/sec, records/sec, and cumulative figures.

start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec
2026-07-21 10:00:00:000, 2026-07-21 10:00:11:842, 4882.81, 412.34, 5000000, 422301.7

Heavier Benchmarks

The CLI tools are great for baselines, but they fall short for scenarios where producers and consumers run simultaneously and you need to measure end-to-end latency. For that, consider:

  • OpenMessaging Benchmark (OMB): a standard workload framework that drives producers and consumers together and reports publish latency and end-to-end latency as percentiles. Often used for cross-system messaging comparisons too.
  • Trogdor: Kafka's own load and fault-injection test harness. Well suited for long-running load tests and reproducing failure scenarios.

3. Measurement Methodology

More important than the tooling is the discipline. With the same tool but no methodology, you end up measuring noise.

Warm-up and Steady State

The JVM is slow for the first tens of seconds due to JIT compilation and class loading. The page cache also starts empty. So discard the warm-up window (the first 1–2 minutes) and analyze only the window after the system reaches steady state. Watching the per-interval stats via --reporting-interval lets you visually confirm when steady state begins.

Percentiles, Not Averages

Average latency hides the tail. Even if p999 is 1 second, the average can look fine at 15 ms. GC pauses, leader elections, and accumulated fetch waits all show up in the tail, so always look at p95, p99, and p999 together. When defining SLOs, write them as percentiles — "p99 < 100 ms" rather than "20 ms average."

One Variable at a Time

This is the most important rule. If you raise batch.size and linger.ms together and throughput improves, you can't tell which one mattered. Change one thing at a time, and compare against the baseline every time.

A Production-Like Workload

  • Message size: measure with production's average/max sizes. Compression effectiveness depends heavily on payload content (repetitiveness), so synthetic random data distorts the compression ratio.
  • Key distribution: if keys skew toward one partition, you get a hot partition. Mimic production's key cardinality so the partitioning effect shows up.
  • Partition count and consumer count: parallelism is a first-order variable for throughput, so match the production topology.
Loading diagram…

Run this loop one variable at a time, and when you're done you can summarize each parameter's impact on throughput and p99 in a table. That table is the real output of tuning.


4. Producer Parameters That Move Performance

A deep dive on producer tuning lives in a separate post, [Kafka Ops 8] Producer Throughput Tuning, so here we summarize from the perspective of what to treat as a variable in a benchmark.

ParameterRoleWhen raised
batch.sizePer-partition batch buffer size (bytes)Throughput↑ (fewer round trips), slight latency↑ until full
linger.msTime to wait to fill a batchThroughput↑, latency↑ (by the wait time)
compression.typelz4/zstd/snappy/gzipNetwork/disk↓, CPU↑. lz4 for speed, zstd for ratio
buffer.memoryTotal producer send bufferIf short, send() blocks and throughput collapses
acksAcknowledgment strength 0/1/allall → durability↑, latency↑; 1 → latency↓, durability↓
max.in.flight.requests.per.connectionUnacked in-flight requestsThroughput↑, but interacts with ordering/retries
enable.idempotenceIdempotent producer (no dup/reorder)Recommend true. Exactly-once send semantics + safe retries

Throughput vs Latency at a Glance

ChangeThroughputLatency (p99)Durability
batch.size ↑ + linger.ms▲ large▼ worse
compression.type=lz4▲ (less network)≈ (with CPU headroom)
acks=allacks=1▲ better▼ risky
enable.idempotence=true

Note: For the deep dive on acks from a durability standpoint, see [Kafka Ops 4] acks · min.insync.replicas. When you change acks in a benchmark, always record the durability change alongside it. You may not have gained speed — you may have sold safety.

# Throughput-oriented producer baseline (then tune one variable at a time)
acks=all
enable.idempotence=true
compression.type=lz4
batch.size=32768
linger.ms=10
max.in.flight.requests.per.connection=5
buffer.memory=67108864

Start from these values, then change only linger.ms from 5 to 20 and measure how throughput and p99 move — and proceed that way.


5. Consumer Parameters — The Commonly Overlooked Spot

Producer tuning gets plenty of airtime, but consumer fetch parameters are discussed relatively less, even though they heavily shape consumer-side throughput and latency. The core idea is the fetch economics of "how much data does the broker accumulate before handing it over in one go?"

How Fetch Parameters Work

The consumer calls poll(), and internally it sends fetch requests to the broker. The broker decides on a response using these rules.

Loading diagram…
ParameterMeaningTrade-off
fetch.min.bytesMinimum bytes the broker gathers before respondingLarger → fewer fetches, throughput↑, latency↑
fetch.max.wait.msUpper bound on waiting even if fetch.min.bytes isn't metThe latency cap on the min.bytes wait. Lower it → latency↓, throughput↓
max.partition.fetch.bytesMax bytes fetched per partition at onceLarger → per-partition throughput↑, memory↑
fetch.max.bytesMax bytes for an entire fetch requestResponse size cap. Interacts with memory/latency
max.poll.recordsMax records returned by one poll()Processing batch size. Too large risks exceeding max.poll.interval.ms
receive.buffer.bytesSocket receive buffer (TCP) sizeThroughput↑ on high-latency, high-bandwidth links

The Tug-of-War Between fetch.min.bytes and fetch.max.wait.ms

These two are a pair. With fetch.min.bytes=1 (the default), the broker responds immediately even with just one byte of data. Latency is minimal, but when traffic is low, fetch requests explode and waste broker CPU and network.

Raise fetch.min.bytes to, say, 64KB, and the broker waits until that much accumulates. Fewer fetches means higher throughput and efficiency, but data arrives at the consumer that much later. You can't let that wait run forever, so fetch.max.wait.ms (default 500ms) acts as the cap. In other words: "respond once 64KB gathers, or unconditionally after 500ms even if it doesn't."

SettingThroughputLatencyGood for
fetch.min.bytes=1, fetch.max.wait.ms=100ModerateVery lowReal-time alerts, latency-first
fetch.min.bytes=65536, fetch.max.wait.ms=500HighMediumGeneral streaming/ETL, throughput-first
fetch.min.bytes=1048576, fetch.max.wait.ms=1000Very highHighBatch-style bulk loads, latency-tolerant

The max.poll.records ↔ max.poll.interval.ms Interaction

max.poll.records sets the application processing batch size, not a network setting. If one poll() returns 500 records, you must process all 500 before the next poll(). If processing time exceeds max.poll.interval.ms (default 5 minutes), the consumer is considered dead, a rebalance triggers, and you can't commit the work done in the meantime — causing duplicate processing.

So if per-record processing is expensive (e.g., an external API call, a DB upsert), lower max.poll.records so one batch finishes within the interval. Conversely, for lightweight processing, raise it to reduce loop overhead. This interaction is directly tied to rebalance storms, so for the deep dive from a lag/fetch angle, also see [Kafka Ops 2] 7 Reasons Consumer Lag Won't Go Down.

# Throughput-oriented consumer baseline
fetch.min.bytes=65536
fetch.max.wait.ms=500
max.partition.fetch.bytes=1048576
fetch.max.bytes=52428800
max.poll.records=500
receive.buffer.bytes=1048576
max.poll.interval.ms=300000

The rule here is the same. Change only fetch.min.bytes from 1 to 65536, measure the throughput and p99 change, then adjust only max.poll.records next — one variable at a time.


Wrapping up

The first step in performance tuning isn't finding better parameters — it's building an environment where you can isolate the effect of a change through measurement. A reproducible baseline, a production-like workload, percentile-based measurement, and one variable at a time — once these four are in place, parameter tuning becomes an experiment rather than a guess.

  • Measure first: tuning without a baseline is guessing. Throughput, latency, and durability are a trade space, not a single dial.
  • Read the percentiles: averages hide the tail. p99 and p999 determine user experience and SLOs.
  • One variable at a time: change two at once and you lose causality. Compare against the baseline every time.
  • Producers: batch.size, linger.ms, compression.type, and acks arbitrate throughput/latency/durability.
  • Consumers: the fetch.min.bytesfetch.max.wait.ms tug-of-war and the max.poll.recordsmax.poll.interval.ms interaction are the crux.

A teaser for what's next. [Kafka Performance 2] covers broker- and partition-level parameters (partition count, num.io.threads, num.network.threads, num.replica.fetchers, segments and page cache). [Kafka Performance 3] wraps up with OS-level tuning (file descriptors, vm.dirty_ratio, the network stack) and combined profiles per workload.

References


— The Data Dynamics Engineering Team