[Kafka Performance 3] OS, Hardware, and Combined Tuning Profiles — Throughput vs Latency vs Durability
From why Kafka is fast (sequential I/O, the page cache, zero-copy) through OS, disk, JVM/GC, and network tuning, to three combined tuning profiles — throughput, latency, durability — with concrete configuration keys. The finale of the Kafka Performance series.
The question people most often ask when they first operate Kafka is, "Why is it so fast?" A messaging system that writes to disk delivers throughput rivaling an in-memory cache — it almost looks like magic. But there is no magic. Kafka's speed comes from a design that borrows the operating system's capabilities directly rather than fighting them. That is why half of Kafka tuning lives not in broker settings but "outside Kafka" — in the OS, the disks, the JVM. If Part 1 covered the benchmarking methodology and Part 2 covered broker-internal threads and flush behavior, this finale places all of it on top of the OS and hardware and ties it together into combined tuning profiles around three goals: throughput, latency, and durability.
What you'll learn in this post
- The real reasons Kafka is fast: sequential disk I/O, the OS page cache, zero-copy (
sendfile)- Why you keep the heap small and yield the rest of RAM to the page cache
- OS-level settings:
vm.swappiness, filesystem (XFS),noatime, ulimit, TCP buffers- The trade-offs of JBOD vs RAID disk layout
- G1GC-based GC tuning and the cascading failures that long pauses cause
- Three combined tuning profiles (producer + consumer + broker): throughput, latency, durability
1. Why Kafka Is Fast — and What That Implies for Tuning
To understand Kafka's performance, you first have to take apart the cliché that "disks are slow." Disks are slow for random access (seeks); for sequential access, modern disks and the OS deliver remarkable throughput. Kafka builds on this fact at its very root.
Sequential disk I/O
A Kafka log is a set of append-only segment files. Messages are appended in order to the end of the partition log, and consumers read forward in offset order. Unlike a database that scatters writes across the disk to update indexes, Kafka pushes the disk head in a single direction. The result: hundreds of MB/s of sequential writes even on HDDs, and more on SSD/NVMe.
The OS page cache
Kafka does not cache data itself. Instead it relies entirely on the OS page cache. When a broker writes a message, the OS stores it in the page cache and flushes it to disk asynchronously; when a consumer reads a recent message, it is served straight from the page cache, not the disk. Because most consumers follow the tail of the log, in practice a large share of reads are handled from RAM without touching the disk at all.
The implication of this design is decisive: the page cache is not a "free cache" — it competes with the JVM heap for RAM.
Zero-copy (sendfile)
Sending a file to the network the traditional way copies the data four times — disk → kernel buffer → application buffer → socket buffer → NIC — with a kernel/user mode switch each time. When sending data to consumers, Kafka uses sendfile (zero-copy) to stream page-cache data straight to the socket without passing through application memory.
[Traditional copy path]
disk → page cache → app buffer → socket buffer → NIC (4 copies, many context switches)
[zero-copy: sendfile]
disk → page cache ──────────────────→ NIC (bypasses app memory)Zero-copy is most effective when the data is already in the page cache. In other words, Kafka is fastest when the page cache is well warmed.
Implication: small heap, RAM for the page cache
Putting the three mechanisms together yields a single practical rule: keep the JVM heap modest (e.g., ~5–6 GB) and yield the rest of RAM to the OS page cache.
| Wrong intuition | What actually happens |
|---|---|
| "A bigger heap is faster" | A bigger heap shrinks the page cache, lowering the disk-hit rate, and lengthens GC pauses |
| "Kafka caches the data" | Kafka relies on the OS page cache. The heap is not a message-body cache |
| "Spare RAM should go to the heap" | The OS automatically uses spare RAM as page cache — and that is faster |
On a broker with 32 GB of memory, a 5–6 GB heap is plenty, and the remaining ~26 GB should be left for the page cache. The broker flush behavior we examined in Part 2 (the argument that you should not force log.flush.interval.messages and friends, but leave write-back to the OS) is rooted in the same philosophy: trust the page cache.
2. OS-Level Settings
No matter how much you tune broker settings, it all collapses if the underlying OS swaps out the page cache or runs out of file descriptors. On production brokers, do not leave the following on their defaults — set them explicitly.
vm.swappiness — turn swap nearly off
vm.swappiness controls how aggressively the kernel pushes memory out to swap. At the default (usually 60), the kernel pushes process memory to swap to free up page cache — and when JVM heap pages get swapped, a single GC has to swap them back in from disk, causing second-scale pauses. Conversely, swapping out the page cache removes Kafka's core advantage.
# Don't disable swap entirely, but use it only as a last resort
sudo sysctl -w vm.swappiness=1
echo 'vm.swappiness=1' | sudo tee -a /etc/sysctl.conf1 means "do not swap unless we are truly on the brink of OOM." 0 makes some kernels invoke the OOM killer more aggressively, so 1 is the safe choice.
Filesystem choice and mount options
| Item | Recommended | Why |
|---|---|---|
| Filesystem | XFS | Strong with large sequential I/O and parallel writes; abundant Kafka operational track record |
| Mount option | noatime | Don't update access-time metadata on every read, eliminating needless writes |
| Mount option | nodiratime | Skip directory access-time updates too |
# /etc/fstab example — mounting the Kafka log directory
/dev/nvme1n1 /data/kafka-logs xfs defaults,noatime,nodiratime 0 0ext4 also works, but many operational guides, including Confluent's, recommend XFS at scale. Options that force fsync at the filesystem level (data=ordered, etc.) trade durability against performance; prefer to secure durability through Kafka replication (RF) instead.
File-descriptor ulimit
A broker keeps multiple segment files and index files open per partition at the same time. As topics, partitions, and segments grow, the number of open files can reach tens to hundreds of thousands. The default ulimit -n (often 1024) quickly hits its limit, and the broker dies with Too many open files.
# /etc/security/limits.conf
kafka soft nofile 100000
kafka hard nofile 100000# Check the current process's open-file count
ls /proc/$(pgrep -f kafka.Kafka)/fd | wc -lSocket connections also consume file descriptors, so the more clients a cluster has, the more generously you must size this.
Basic TCP/network tuning
On high-bandwidth (10 GbE and up) links or paths with latency (cross-AZ/region replication), the TCP socket buffer becomes the bottleneck. To fill the bandwidth-delay product (BDP), you must enlarge the buffers.
# /etc/sysctl.conf — raise TCP buffers for high-bandwidth links
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.core.somaxconn=4096On the broker side you can leave socket.send.buffer.bytes / socket.receive.buffer.bytes at -1 (use the OS default) or set them explicitly to match a long-distance replication path.
3. Disk Layout — JBOD vs RAID
Kafka lets you specify multiple directories (multiple disks) in log.dirs, and partitions are spread across them. How you group the disks governs both throughput and durability.
JBOD (Just a Bunch Of Disks)
List each disk independently in log.dirs.
# server.properties — JBOD: four independent disks
log.dirs=/data/disk1/kafka,/data/disk2/kafka,/data/disk3/kafka,/data/disk4/kafka- Pros: You can sum the I/O bandwidth of all four disks, giving the highest total throughput. There is no RAID parity overhead either.
- Cons: If one disk fails, the partitions on that disk go offline. Fortunately, with RF≥2 a replica on another broker takes over leadership and availability is preserved — but that broker needs a disk replacement and re-replication.
RAID
| Mode | Characteristics | Kafka suitability |
|---|---|---|
| RAID 0 | Striping, no fault tolerance | Not recommended — one disk failure = total loss |
| RAID 10 | Striping + mirroring | Recommended — broker keeps running through a disk failure, at the cost of 50% capacity |
| RAID 5/6 | Parity | Write-parity computation overhead hurts write-intensive workloads |
RAID 10 keeps a single disk failure from stopping the whole broker, but spends half the capacity on mirroring.
Which to choose
JBOD → Max throughput, cost efficiency. Some partitions go offline on disk failure.
Secure availability at the replication layer with RF≥3 + min.insync.replicas=2.
RAID10 → Broker-level fault tolerance. No interruption on a single disk failure.
Accept 50% capacity loss. Replication + RAID = double safety.The crux is the choice of where you secure durability — at the disk or in Kafka replication (RF). As covered in Part ⑦ "Disk-Full Incident Response," when a disk fills up, every partition in that log directory blocks and the broker is at risk. Because JBOD requires per-disk capacity management, per-disk usage monitoring and log.retention.bytes / retention policy matter even more than with RAID.
4. JVM / GC Tuning
A broker runs on the JVM. When GC pauses for a long time, that broker simultaneously falls behind on its heartbeat to the ZooKeeper/KRaft controller, follower fetch responses, and consumer group membership all at once. A GC pause is not a mere delay — it is the trigger for a cascade of failures.
Use the G1GC defaults, but aim for "short, predictable pauses"
The recommended collector for current Kafka/JDK combinations is G1GC. Rather than one giant pause, it takes frequent short pauses in small units, which suits latency-sensitive workloads like Kafka.
# KAFKA_HEAP_OPTS — small heap (yield RAM to the page cache)
export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
# KAFKA_JVM_PERFORMANCE_OPTS — G1GC, target pause 20ms
export KAFKA_JVM_PERFORMANCE_OPTS="\
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=20 \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:+ExplicitGCInvokesConcurrent \
-XX:G1HeapRegionSize=16M"Set -Xms and -Xmx equal to eliminate pauses from heap resizing, and let G1 auto-tune toward the pause goal via MaxGCPauseMillis.
The cascade a long GC pause triggers
Long Full GC pause (e.g., several seconds)
│
├─ Missed consumer heartbeats → session timeout → rebalance (Series ⑥ Rebalance Storm)
├─ Follower fetch delay → exceeds replica.lag.time.max.ms → ISR shrink (Series ⑨ ISR shrink)
└─ acks=all producers hit NotEnoughReplicas as ISR falls shortIn other words, oversizing the heap is, counterintuitively, dangerous. A large heap means more objects for a single GC to sweep, so pauses grow longer and the cascade above becomes more likely. Resist the temptation of "we have lots of RAM, so let's make the heap big too": it is safer to keep the heap small and monitor the pause distribution with GC logs.
# Enable GC logging (observe pause-time distribution)
export KAFKA_GC_LOG_OPTS="-Xlog:gc*:file=/var/log/kafka/gc.log:time,uptime:filecount=10,filesize=100M"5. Network and Placement
NIC bandwidth as a real ceiling
If throughput stops climbing past some point in your benchmark, the culprit is often not the disk but NIC bandwidth. Once you factor in replication, the network load is larger than it seems.
e.g., RF=3, producer ingress 300 MB/s, network on one broker
- Ingress (producer → leader): ~100 MB/s
- Replication out (leader → 2 followers): ~200 MB/s
- Consumer egress (follower/leader → consumers): variable (scales with number of consumer groups)
→ 1 GbE (~125 MB/s) is immediately the limit. 10 GbE and up recommended.Replication consumes (RF-1)× more network than producer ingress, and consumer fan-out piles on top. Do not leave out of your capacity plan the fact that more consumer groups multiply the egress bandwidth.
Cross-AZ/region placement cost
If you spread brokers across multiple availability zones (AZs) for durability, replication traffic crosses AZ boundaries, creating data-transfer cost and extra latency. Across regions, the cost and latency are larger still.
- Cross-AZ: latency is usually a few ms, but cloud providers charge for inter-AZ transfer. With
acks=all, the follower-ACK round trip is directly reflected in latency. - Cross-region synchronous replication: not recommended. RTTs of tens of ms or more are added to every write, collapsing throughput. The standard for cross-region is asynchronous replication such as MirrorMaker 2, covered in the separate DR series (① DR Design RPO/RTO, ⑮ MirrorMaker 2 Setup).
Using broker.rack and rack-aware replica placement, you can ensure replicas of the same partition land in different AZs, preserving data even through an AZ failure.
6. The Payoff — Three Combined Tuning Profiles
It is time to tie everything so far (page cache, OS, disk, JVM, network, and the producer/broker settings from Parts 1 and 2) into one. In practice, tuning ultimately converges on a single question: "Of throughput, latency, and durability, which does our workload want most?" Here are three representative profiles, expressed in concrete configuration keys.
The values in the tables below are starting points — not absolute answers, but baselines you must validate in your own environment with the benchmarking method from Part 1 (
kafka-producer-perf-test/kafka-consumer-perf-test, changing one variable at a time).
6.1 Throughput-Optimized Profile
For cases like bulk log ingestion or analytics pipelines, where the goal is "as many messages per second as possible" and tens to hundreds of ms of latency is acceptable.
| Layer | Setting | Value | Intent |
|---|---|---|---|
| Producer | batch.size | 256000–512000 | Maximize messages per request with large batches |
| Producer | linger.ms | 50–100 | Wait for batches to fill more |
| Producer | compression.type | zstd or lz4 | Cut network/disk transfer volume |
| Producer | buffer.memory | 134217728 (128MB) | Enough buffer to hold large batches |
| Producer | acks | 1 | Favor throughput (accept durability cost — see note) |
| Consumer | fetch.min.bytes | 1048576 (1MB) | Fetch large amounts at once, reducing request count |
| Consumer | fetch.max.wait.ms | 500 | Allow waiting until min.bytes fills |
| Consumer | max.partition.fetch.bytes | 5242880 (5MB) | Allow large per-partition fetches |
| Broker | num.io.threads | disk count × 2–3 | Widen disk I/O parallelism |
| Broker | num.network.threads | proportional to cores | Scale network handling |
| Broker | num.replica.fetchers | 4–8 | Improve follower replication parallelism |
Note (durability cost):
acks=1treats success once the leader receives the message, so if the leader dies before replication, the message can be lost (see Series ④ acks/min.insync.replicas, ③ Message-Loss Scenarios). If you cannot tolerate loss at all, keepacks=alland make up throughput with batching and compression.
6.2 Latency-Optimized Profile
For cases like payment notifications or real-time trading signals, where the goal is "even one record, fast." You concede some throughput.
| Layer | Setting | Value | Intent |
|---|---|---|---|
| Producer | linger.ms | 0 | Send immediately without waiting to accumulate |
| Producer | batch.size | 16384 (default) or smaller | Minimize batch waiting |
| Producer | compression.type | lz4 or none | Minimize compression CPU delay (lz4 is fast) |
| Producer | max.in.flight.requests.per.connection | 5 (+ enable.idempotence=true) | Keep pipelining while preserving order via idempotence |
| Producer | acks | 1 (latency-first) / all (durability too) | Choose the trade-off |
| Consumer | fetch.min.bytes | 1 | Return as soon as any data exists |
| Consumer | fetch.max.wait.ms | 10–50 | Minimize fetch waiting |
| Consumer | max.poll.records | 100–500 | Fewer records per poll shortens processing latency |
| Broker | replica.fetch.wait.max.ms | small | Shorten replication delay |
| Topic | partition count | spread sufficiently | Ease queuing latency through consumer parallelism |
The key is to remove all "waiting." With linger.ms=0, fetch.min.bytes=1, and a small fetch.max.wait.ms, nothing waits anywhere for a batch to fill. Just be sure to enable enable.idempotence=true so order is preserved on retries even at max.in.flight=5 (Series ⑩ Ordering Guarantees).
6.3 Durability-Optimized Profile
For cases like financial ledgers or audit logs, where "absolutely no loss" is the top priority. You choose data safety even at some cost to throughput and latency.
| Layer | Setting | Value | Intent |
|---|---|---|---|
| Producer | acks | all | Success only once all ISRs receive it |
| Producer | enable.idempotence | true | Prevent duplicates and order breaks on retry |
| Producer | retries | Integer.MAX_VALUE | Retry persistently through transient failures |
| Producer | max.in.flight.requests.per.connection | 5 (with idempotence) | Keep ordering guarantees |
| Producer | delivery.timeout.ms | sufficiently large | Let retries run to completion |
| Topic | replication.factor (RF) | 3 | Preserve data through two simultaneous broker failures |
| Topic | min.insync.replicas | 2 | Reject writes when fewer than 2 ISRs (avoid half-written data) |
| Broker/Topic | unclean.leader.election.enable | false | Block a lagging replica from becoming leader and losing data |
This combination gathers in one place the safeguards covered in Series ③ (Message-Loss Scenarios), ④ (acks / min.insync.replicas), and ⑤ (Unclean Leader Election). In particular, min.insync.replicas=2 and acks=all must work as a pair to mean anything. With RF=3 and MISR=2, writes continue even if one broker dies; if two die, writes are rejected — choosing explicit failure over unguarded loss.
Profiles at a glance
| Goal | Throughput | Latency | Durability | Representative workload |
|---|---|---|---|---|
| Throughput-optimized | ◎ | △ | ○ (◎ if acks=all) | Log ingestion, batch analytics |
| Latency-optimized | ○ | ◎ | ○ | Real-time notifications, trading |
| Durability-optimized | △ | △ | ◎ | Financial ledgers, audit logs |
7. Mapping the Three Goals to Their Dominant Knobs
Here is a single diagram of which "knobs" each profile turns, and in which direction.
The key point: all three branches sit on the same shared foundation (OS and hardware). Whichever profile you pick, if the floor — page cache, GC, disk, NIC — is weak, the upper-layer settings will not behave as intended.
8. Series Wrap-Up — Performance Parts 1–3
The Kafka Performance mini-series boils down to one line: "measure → change one variable at a time → respect the OS." Here are the three parts at a glance.
| Part | Topic | Core message |
|---|---|---|
| ① Benchmarking methodology | Measure throughput/latency with kafka-*-perf-test | Tuning without measurement is guessing. Change one variable at a time |
| ② Broker internals, threads, flush | I/O and network threads, page-cache flush | Don't force flushes — trust OS write-back |
| ③ OS, hardware, combined profiles | Page cache, zero-copy, GC, disk, three profiles | Half of tuning lives outside Kafka (OS and hardware) |
The principles running through all three parts do not change.
- Measure: the benchmark, not intuition, is the truth. Validate with a load resembling production.
- Change one variable: change
batch.sizeandacksat once and you won't know which one mattered. - Respect the OS: small heap, ample page cache, short GC, and recognize disk and NIC as ceilings.
Performance tuning is part of broader operations. Chasing throughput by dropping to acks=1 and inviting loss (Series ③/④); long GC pauses triggering a rebalance storm (⑥) and ISR shrink (⑨); a disk-full event (⑦) stalling partitions — all of these are evidence that "performance" and "stability" are one body. Durability and replication design also connect directly to the DR series (① DR Design RPO/RTO, ⑮ MirrorMaker 2 Setup, ⑯ Offset Translation, ⑰ Failover/Failback Runbook). Don't view performance, operations, and DR separately — start from a single set of workload requirements and tune them together.
Wrapping up
- Kafka's speed comes from borrowing OS capabilities directly — sequential I/O + OS page cache + zero-copy (
sendfile). That is why the first principle is to keep the heap small and yield RAM to the page cache. - At the OS level,
vm.swappiness=1, XFS +noatime, generous file-descriptor ulimits, and high-bandwidth TCP buffers are items you should not leave at their defaults. - For disks, choose between JBOD (max throughput) and RAID10 (broker-level fault tolerance) based on your workload and replication strategy.
- A GC pause is the trigger for a cascade of failures. Aim for short, predictable pauses with G1GC, and don't oversize the heap.
- The three profiles — throughput, latency, durability — are starting points. Always validate them in your environment with the Part 1 benchmarking method.
- The one line running through the series: measure, change one variable at a time, and respect the OS.
References
- Apache Kafka Documentation — Hardware and OS — https://kafka.apache.org/documentation/#hwandos
- Apache Kafka Documentation — Java/JVM — https://kafka.apache.org/documentation/#java
- Confluent. "Kafka Performance: Capacity Planning and Sizing" — https://docs.confluent.io/
- Linux
vm.swappiness/sysctltuning guide — https://www.kernel.org/doc/Documentation/sysctl/vm.txt
— The Data Dynamics Engineering Team