kafkaperformancebrokerpartitionsreplicationtuning

[Kafka Performance 2] Brokers and Partitions — Threads, Replication, and Partition Count

How to tune the Kafka broker thread model (num.network.threads/num.io.threads), socket buffers, replication throughput (num.replica.fetchers), and the biggest lever of all — partition count — driven by the metrics that tell you what to change.

Data DynamicsJune 21, 202613 min read

Part 1 pushed producer and consumer throughput up. Now it's time to look at the side that actually absorbs that traffic — the broker. On the same hardware, broker throughput can vary by several times depending on how it divides its threads and how many partitions you split a topic into. And the most common mistake in this area is "bumping numbers by gut feeling." The real key to broker tuning is knowing which metric points at which parameter. This post is organized around exactly that correspondence.

What you'll learn in this post

The broker's two thread pools (num.network.threads/num.io.threads) and the path a request travels

Which JMX metric tells you precisely which thread pool to grow

What socket/buffer settings mean on high-bandwidth, high-latency links

How replication throughput (num.replica.fetchers) relates to under-replicated partitions

Why you should leave log.flush.* alone — durability comes from replication

Partition count, the biggest lever: the throughput gains and what they cost

1. The Broker Thread Model — How a Request Flows

To understand broker tuning, you first have to picture the path a single request takes inside the broker. The Kafka broker does not handle an incoming request on a single thread. Two thread pools with different roles cooperate across a queue in the middle.

Two thread pools

Setting	Role	Default	One-line summary
`num.network.threads`	Network processors	3	Read requests off the socket, write responses back
`num.io.threads`	Request handlers	8	Do the actual work — disk I/O, replication, etc.
`queued.max.requests`	Request queue	500	The buffer (queue) between network and I/O threads

The point is the separation of roles. Network threads never touch the disk. Their job is to read bytes off the socket, parse them into a request object, and place that into the shared request queue. Conversely, I/O (request handler) threads never deal with the socket directly. They pull requests off the queue and perform the heavy work — writing to the log, reading from disk, updating replication state.

The full request path

Client
   │  (TCP)
   ▼
[Acceptor thread]               ─ accepts connections, hands off to processors
   ▼
[Network Thread]                 ─ num.network.threads
   │  socket read → parse request
   ▼
[Request Queue]                  ─ queued.max.requests (shared queue)
   ▼
[I/O / Request Handler Thread]   ─ num.io.threads
   │  log append / disk read / replication
   ▼
[Purgatory]                      ─ requests that can't be answered immediately
   │  (ISR ack wait for acks=all, min.bytes wait for fetch, etc.)
   ▼
[Response Queue]                 ─ per-network-thread response queue
   ▼
[Network Thread]                 ─ socket write → response to client

Purgatory is an easily confused concept, so let's pin it down. An acks=all produce request cannot be answered the instant it's written to the leader. It has to wait until the ISR (in-sync replica) followers replicate up to that offset. A consumer's fetch request may likewise wait until fetch.min.bytes of data accumulates. The place where these requests held until a condition is met sit is Purgatory. In other words, requests piling up in Purgatory is normal — and crucially, they wait there without occupying an I/O thread.

Once you understand this structure, the tuning rule in the next section follows naturally. The starting point is to determine which pool is the bottleneck with metrics, not numbers.

2. The Metric-Driven Tuning Rule — Don't Guess, Measure

Blindly raising thread counts is not the answer. More threads mean more context switching and lock contention, which can make things slower. Fortunately, Kafka exposes the idle percent of each thread pool precisely via JMX. These two metrics tell you essentially everything.

The two key idle-percent metrics

Metric (JMX)	Meaning	Parameter it points at
`kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent`	Average idle ratio of I/O (request handler) threads	`num.io.threads`
`kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent`	Average idle ratio of network processors	`num.network.threads`

Both are values in the range 0.0 – 1.0 (1.0 = 100% idle, i.e., doing nothing).

How to read them

RequestHandlerAvgIdlePercent
  ≈ 0.0  →  I/O threads saturated. Disk/replication work is backing up.
            → raise num.io.threads (but stay within disk core count)
  ≈ 1.0  →  I/O threads have slack. Safe to reduce.
 
NetworkProcessorAvgIdlePercent
  ≈ 0.0  →  Network threads saturated. Socket read/write is backing up.
            → raise num.network.threads
  ≈ 1.0  →  Network threads have slack.

A field rule of thumb: when idle percent stays persistently below 0.3 (30%), treat that pool as a candidate to grow. If it has hit 0.0, requests are already backing up in the queue — that's on the late side. Adjust one thing at a time and re-measure after each change. If you raise both parameters at once, you won't know which one helped.

Caution: Raising num.io.threads beyond your disk core/spindle count is pointless if the disk is the bottleneck. If I/O threads are always busy (idle ≈ 0) AND disk utilization is at 100%, that's not a thread shortage — it's a disk bandwidth shortage, covered in Part 3 (OS/hardware).

The request pipeline diagram

Loading diagram…

The two dashed arrows in the diagram are the summary of this whole post. A metric points at a thread pool, and that thread pool is the parameter.

3. Sockets and Buffers — The High-Bandwidth, High-Latency Trap

Even with enough threads, small socket buffers can choke throughput. This is especially true on links with large bandwidth but also large latency (large BDP) — cross-datacenter replication or a geographically distant MirrorMaker.

Setting	Meaning	Default (for reference)
`socket.send.buffer.bytes`	Socket send buffer (SO_SNDBUF)	102400 (100KB)
`socket.receive.buffer.bytes`	Socket receive buffer (SO_RCVBUF)	102400 (100KB)
`socket.request.max.bytes`	Maximum size of a single request	104857600 (100MB)

Size the buffer to fill the BDP

The ceiling on TCP throughput is the BDP (Bandwidth-Delay Product = bandwidth × round-trip latency). If the buffer is smaller than the BDP, the sender can't fill the pipe because it stalls waiting for ACKs.

Example: 10 Gbps link, RTT 50ms (cross-DC)
  BDP = 10e9 bit/s × 0.05 s / 8 = 62.5 MB
 
  With the default 100KB send buffer, you use only a tiny fraction of the link.
  Raise socket.send.buffer.bytes / socket.receive.buffer.bytes to
  the BDP level (or set -1 to delegate to OS autotuning) to fill the link.

Setting -1 delegates to the OS's TCP autotuning. On Linux, if net.ipv4.tcp_rmem/tcp_wmem are set generously, this is often the cleaner choice. socket.request.max.bytes must be large enough not to reject big batches (a large replica.fetch.max.bytes or a large produce batch) — but since it's a memory-protection ceiling, don't grow it without bound.

4. Replication Throughput — The Lever That Reduces Under-Replicated Partitions

The reason brokers form a cluster is replication. And when replication can't keep up, it shows up immediately as URP (Under-Replicated Partitions). A non-zero URP signals that some follower isn't catching up to its leader — and if the leader dies in that state, the risk of data loss rises.

Follower fetch parallelism

Followers replicate by fetching data from leaders. num.replica.fetchers decides how many threads parallelize that fetch.

Setting	Meaning	Default
`num.replica.fetchers`	Threads a broker uses to fetch replication data from leaders	1
`replica.fetch.max.bytes`	Max bytes per partition in a fetch response	1048576 (1MB)
`replica.fetch.min.bytes`	Min accumulated bytes before a fetch response is returned	1

The default num.replica.fetchers=1 means a single thread per source broker. On a cluster with many partitions and heavy write load, that one thread can't carry all the replication, and URP spikes. Raising num.replica.fetchers to 2–4 parallelizes replication fetches so followers catch up to leaders faster.

Symptom: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions > 0 persists
         (worsens during load spikes)
 
Diagnosis order:
  1. Is the network/disk saturated?  → if so, hardware problem (Part 3)
  2. If not, suspect insufficient replication fetch parallelism
     → raise num.replica.fetchers (e.g., 1 → 4)
  3. If messages are large, raise replica.fetch.max.bytes too
     (if it's smaller than max.message.bytes, replication can stall)

Raising replica.fetch.min.bytes lets a follower collect several messages and receive them in one shot, reducing fetch count (more efficient) but potentially increasing replication latency — so weigh throughput against latency. The key metrics to watch are UnderReplicatedPartitions (should be 0) and ReplicaFetcherManager's MaxLag (how far the follower trails the leader).

5. Don't Fight the OS — Leave `log.flush.*` Alone

The most tempting (and almost always wrong) touch in broker tuning is forcing flushes. With the thought "wouldn't fsyncing every message be safer?", people often reach for these settings.

Setting	Meaning	Default behavior
`log.flush.interval.messages`	Force fsync every N messages	Effectively unlimited (unset)
`log.flush.interval.ms`	Force fsync every N ms	Unset (delegated to OS)

Bottom line first: don't touch these. Kafka's core philosophy is to leave fsync to the OS page cache. Data the producer sends is written to the page cache first, and the actual flush to disk is handled by the OS in batches at efficient moments. Forcing an fsync per message breaks this batching effect and drops throughput by several to tens of times.

Durability is the job of replication, not fsync

"Then what about data safety?" is the natural follow-up. In Kafka, the mechanism that guarantees durability is replication, not the fsync of an individual disk.

With producer acks=all + topic min.insync.replicas=2, a message must land in the logs (page cache included) of at least 2 brokers before an ack returns.
Even if one broker dies before fsync and loses its page cache contents, another ISR broker holds that data.
That is, "replicated to multiple machines" is a stronger guarantee than "force-written to disk on one machine." It's safer for the data to live on several machines than for one machine's disk to survive.

The meaning and combinations of acks and min.insync.replicas are covered in detail in Part 4 (producer reliability) of this series. Here, just remember the two: "durability is replication's job, flush is the OS's job." The moment you force a flush, you give up throughput without gaining any safety beyond what replication already guarantees.

6. Partition Count — The Biggest Lever, and the Biggest Trap

A single decision affects throughput more than all the settings above combined: partition count. Partitions are the basic unit of Kafka parallelism.

Why throughput rises

A partition is an independent, order-preserving log distributed across different brokers/disks → writes and reads are parallelized.
A consumer group's parallelism ceiling = partition count. With 10 partitions, at most 10 consumers can work concurrently (any more sit idle).
So aggregate throughput generally scales with partition count — up to a point.

But it's not free — the partition-count trade-offs

Benefit when increased	Cost paid when increased
Producer/consumer parallelism ↑	More open file handles per broker (many segment files per partition)
Aggregate throughput ↑	More client memory/requests (buffers and metadata per partition)
Room for consumer scale-out ↑	Longer leader election/recovery time (more leaders to move on broker failure)
Hot-partition spreading	Higher end-to-end latency (wider replication fan-out)
	Heavier rebalances (more partitions to move on membership change)

Watch out especially for leader election/recovery time. When one broker dies, a new leader must be elected for every partition it led; with tens of thousands of partitions, this process drags on and directly hurts availability. "Plenty of partitions, just in case" is expensive insurance.

A sizing heuristic

Size partition count by dividing target throughput by the throughput a single partition can sustain. But compute the producer side and consumer side separately, and take the larger value.

Target throughput = T (e.g., 1 GB/s)
 
Producer side:  per-partition write throughput = Tp  →  partitions = ceil(T / Tp)
Consumer side:  per-partition read throughput  = Tc  →  partitions = ceil(T / Tc)
 
partition count = max( ceil(T / Tp), ceil(T / Tc) )
 
e.g.) T = 1000 MB/s, Tp = 50 MB/s, Tc = 25 MB/s
      producer side = ceil(1000/50) = 20
      consumer side = ceil(1000/25) = 40
      → partition count = max(20, 40) = 40

Tp and Tc vary by environment, so always benchmark with a representative message size and use the measured values (see Part 1 for measuring producer/consumer throughput).

Easy to increase, hard to decrease — and key→partition stability

This is the most important constraint when deciding partition count.

Partitions can be increased but not (easily) decreased. To reduce, you must create a new topic and re-load the data.
A more important trap: adding partitions breaks the key→partition mapping. The default partitioner picks a partition with hash(key) % partitionCount, so the moment partitionCount changes, the same key goes to a different partition than before.
As a result, per-key message ordering can break at the boundary of when partitions were added. For a system that depends on per-key ordering (e.g., per-user event order), adding partitions in production can be catastrophic.

The principles and caveats of key-based ordering are covered in Part 10 (message ordering and key partitioning) of this series. The lesson here is clear — set partition count generously up front, but not excessively, and only change it in production after carefully reviewing ordering dependencies.

Wrapping up

One sentence runs through broker and partition tuning: don't guess — change the parameter the metric points at.

There are two thread pools: num.network.threads (socket read/write) and num.io.threads (disk/replication work), joined by the queued.max.requests queue.
When RequestHandlerAvgIdlePercent nears 0, raise num.io.threads; when NetworkProcessorAvgIdlePercent nears 0, raise num.network.threads — one at a time, measuring as you go.
On high-bandwidth, high-latency links, size socket buffers to the BDP or set -1 to delegate to OS autotuning.
If URP won't drop, raise num.replica.fetchers to parallelize replication fetches — but first rule out hardware saturation.
Leave log.flush.* alone. Durability is the job of acks=all + min.insync.replicas replication, and flush is left to the OS page cache (Part 4).
Partition count is the biggest lever, but it isn't free. Size it as target throughput ÷ per-partition throughput, while weighing file handles, recovery time, and rebalance cost. Easy to increase, hard to decrease, and adding partitions breaks key→partition stability (Part 10).

In the final Part 3, we'll cover the last layer — the OS and hardware (page cache, disk/filesystem, JVM/GC, network kernel parameters) — and tie the settings from Parts 1–3 into scenario profiles (high-throughput / low-latency / high-durability). For what to look at first in production, see Part 9 (incident runbook / key metrics) of the series; for throughput wobble from rebalances, see Part 6 (consumer rebalances); and for producer/consumer tuning itself, see Performance Part 1.

References

Apache Kafka Documentation — Broker Configs: https://kafka.apache.org/documentation/#brokerconfigs

Apache Kafka Documentation — Monitoring: https://kafka.apache.org/documentation/#monitoring

— The Data Dynamics Engineering Team