Blog
kafkaperformanceoperating-systemjvmtuningcapacity-planning

[Kafka Performance 3] OS, Hardware, and Combined Tuning Profiles — Throughput vs Latency vs Durability

From why Kafka is fast (sequential I/O, the page cache, zero-copy) through OS, disk, JVM/GC, and network tuning, to three combined tuning profiles — throughput, latency, durability — with concrete configuration keys. The finale of the Kafka Performance series.

Data DynamicsJune 22, 202619 min read

The question people most often ask when they first operate Kafka is, "Why is it so fast?" A messaging system that writes to disk delivers throughput rivaling an in-memory cache — it almost looks like magic. But there is no magic. Kafka's speed comes from a design that borrows the operating system's capabilities directly rather than fighting them. That is why half of Kafka tuning lives not in broker settings but "outside Kafka" — in the OS, the disks, the JVM. If Part 1 covered the benchmarking methodology and Part 2 covered broker-internal threads and flush behavior, this finale places all of it on top of the OS and hardware and ties it together into combined tuning profiles around three goals: throughput, latency, and durability.

What you'll learn in this post

  • The real reasons Kafka is fast: sequential disk I/O, the OS page cache, zero-copy (sendfile)
  • Why you keep the heap small and yield the rest of RAM to the page cache
  • OS-level settings: vm.swappiness, filesystem (XFS), noatime, ulimit, TCP buffers
  • The trade-offs of JBOD vs RAID disk layout
  • G1GC-based GC tuning and the cascading failures that long pauses cause
  • Three combined tuning profiles (producer + consumer + broker): throughput, latency, durability

1. Why Kafka Is Fast — and What That Implies for Tuning

To understand Kafka's performance, you first have to take apart the cliché that "disks are slow." Disks are slow for random access (seeks); for sequential access, modern disks and the OS deliver remarkable throughput. Kafka builds on this fact at its very root.

Sequential disk I/O

A Kafka log is a set of append-only segment files. Messages are appended in order to the end of the partition log, and consumers read forward in offset order. Unlike a database that scatters writes across the disk to update indexes, Kafka pushes the disk head in a single direction. The result: hundreds of MB/s of sequential writes even on HDDs, and more on SSD/NVMe.

The OS page cache

Kafka does not cache data itself. Instead it relies entirely on the OS page cache. When a broker writes a message, the OS stores it in the page cache and flushes it to disk asynchronously; when a consumer reads a recent message, it is served straight from the page cache, not the disk. Because most consumers follow the tail of the log, in practice a large share of reads are handled from RAM without touching the disk at all.

The implication of this design is decisive: the page cache is not a "free cache" — it competes with the JVM heap for RAM.

Zero-copy (sendfile)

Sending a file to the network the traditional way copies the data four times — disk → kernel buffer → application buffer → socket buffer → NIC — with a kernel/user mode switch each time. When sending data to consumers, Kafka uses sendfile (zero-copy) to stream page-cache data straight to the socket without passing through application memory.

[Traditional copy path]
disk → page cache → app buffer → socket buffer → NIC   (4 copies, many context switches)
 
[zero-copy: sendfile]
disk → page cache ──────────────────→ NIC              (bypasses app memory)

Zero-copy is most effective when the data is already in the page cache. In other words, Kafka is fastest when the page cache is well warmed.

Implication: small heap, RAM for the page cache

Putting the three mechanisms together yields a single practical rule: keep the JVM heap modest (e.g., ~5–6 GB) and yield the rest of RAM to the OS page cache.

Wrong intuitionWhat actually happens
"A bigger heap is faster"A bigger heap shrinks the page cache, lowering the disk-hit rate, and lengthens GC pauses
"Kafka caches the data"Kafka relies on the OS page cache. The heap is not a message-body cache
"Spare RAM should go to the heap"The OS automatically uses spare RAM as page cache — and that is faster

On a broker with 32 GB of memory, a 5–6 GB heap is plenty, and the remaining ~26 GB should be left for the page cache. The broker flush behavior we examined in Part 2 (the argument that you should not force log.flush.interval.messages and friends, but leave write-back to the OS) is rooted in the same philosophy: trust the page cache.


2. OS-Level Settings

No matter how much you tune broker settings, it all collapses if the underlying OS swaps out the page cache or runs out of file descriptors. On production brokers, do not leave the following on their defaults — set them explicitly.

vm.swappiness — turn swap nearly off

vm.swappiness controls how aggressively the kernel pushes memory out to swap. At the default (usually 60), the kernel pushes process memory to swap to free up page cache — and when JVM heap pages get swapped, a single GC has to swap them back in from disk, causing second-scale pauses. Conversely, swapping out the page cache removes Kafka's core advantage.

# Don't disable swap entirely, but use it only as a last resort
sudo sysctl -w vm.swappiness=1
echo 'vm.swappiness=1' | sudo tee -a /etc/sysctl.conf

1 means "do not swap unless we are truly on the brink of OOM." 0 makes some kernels invoke the OOM killer more aggressively, so 1 is the safe choice.

Filesystem choice and mount options

ItemRecommendedWhy
FilesystemXFSStrong with large sequential I/O and parallel writes; abundant Kafka operational track record
Mount optionnoatimeDon't update access-time metadata on every read, eliminating needless writes
Mount optionnodiratimeSkip directory access-time updates too
# /etc/fstab example — mounting the Kafka log directory
/dev/nvme1n1  /data/kafka-logs  xfs  defaults,noatime,nodiratime  0  0

ext4 also works, but many operational guides, including Confluent's, recommend XFS at scale. Options that force fsync at the filesystem level (data=ordered, etc.) trade durability against performance; prefer to secure durability through Kafka replication (RF) instead.

File-descriptor ulimit

A broker keeps multiple segment files and index files open per partition at the same time. As topics, partitions, and segments grow, the number of open files can reach tens to hundreds of thousands. The default ulimit -n (often 1024) quickly hits its limit, and the broker dies with Too many open files.

# /etc/security/limits.conf
kafka  soft  nofile  100000
kafka  hard  nofile  100000
# Check the current process's open-file count
ls /proc/$(pgrep -f kafka.Kafka)/fd | wc -l

Socket connections also consume file descriptors, so the more clients a cluster has, the more generously you must size this.

Basic TCP/network tuning

On high-bandwidth (10 GbE and up) links or paths with latency (cross-AZ/region replication), the TCP socket buffer becomes the bottleneck. To fill the bandwidth-delay product (BDP), you must enlarge the buffers.

# /etc/sysctl.conf — raise TCP buffers for high-bandwidth links
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.core.somaxconn=4096

On the broker side you can leave socket.send.buffer.bytes / socket.receive.buffer.bytes at -1 (use the OS default) or set them explicitly to match a long-distance replication path.


3. Disk Layout — JBOD vs RAID

Kafka lets you specify multiple directories (multiple disks) in log.dirs, and partitions are spread across them. How you group the disks governs both throughput and durability.

JBOD (Just a Bunch Of Disks)

List each disk independently in log.dirs.

# server.properties — JBOD: four independent disks
log.dirs=/data/disk1/kafka,/data/disk2/kafka,/data/disk3/kafka,/data/disk4/kafka
  • Pros: You can sum the I/O bandwidth of all four disks, giving the highest total throughput. There is no RAID parity overhead either.
  • Cons: If one disk fails, the partitions on that disk go offline. Fortunately, with RF≥2 a replica on another broker takes over leadership and availability is preserved — but that broker needs a disk replacement and re-replication.

RAID

ModeCharacteristicsKafka suitability
RAID 0Striping, no fault toleranceNot recommended — one disk failure = total loss
RAID 10Striping + mirroringRecommended — broker keeps running through a disk failure, at the cost of 50% capacity
RAID 5/6ParityWrite-parity computation overhead hurts write-intensive workloads

RAID 10 keeps a single disk failure from stopping the whole broker, but spends half the capacity on mirroring.

Which to choose

JBOD   →  Max throughput, cost efficiency. Some partitions go offline on disk failure.
          Secure availability at the replication layer with RF≥3 + min.insync.replicas=2.
 
RAID10 →  Broker-level fault tolerance. No interruption on a single disk failure.
          Accept 50% capacity loss. Replication + RAID = double safety.

The crux is the choice of where you secure durability — at the disk or in Kafka replication (RF). As covered in Part ⑦ "Disk-Full Incident Response," when a disk fills up, every partition in that log directory blocks and the broker is at risk. Because JBOD requires per-disk capacity management, per-disk usage monitoring and log.retention.bytes / retention policy matter even more than with RAID.


4. JVM / GC Tuning

A broker runs on the JVM. When GC pauses for a long time, that broker simultaneously falls behind on its heartbeat to the ZooKeeper/KRaft controller, follower fetch responses, and consumer group membership all at once. A GC pause is not a mere delay — it is the trigger for a cascade of failures.

Use the G1GC defaults, but aim for "short, predictable pauses"

The recommended collector for current Kafka/JDK combinations is G1GC. Rather than one giant pause, it takes frequent short pauses in small units, which suits latency-sensitive workloads like Kafka.

# KAFKA_HEAP_OPTS — small heap (yield RAM to the page cache)
export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
 
# KAFKA_JVM_PERFORMANCE_OPTS — G1GC, target pause 20ms
export KAFKA_JVM_PERFORMANCE_OPTS="\
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=20 \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:+ExplicitGCInvokesConcurrent \
-XX:G1HeapRegionSize=16M"

Set -Xms and -Xmx equal to eliminate pauses from heap resizing, and let G1 auto-tune toward the pause goal via MaxGCPauseMillis.

The cascade a long GC pause triggers

Long Full GC pause (e.g., several seconds)

   ├─ Missed consumer heartbeats → session timeout → rebalance (Series ⑥ Rebalance Storm)
   ├─ Follower fetch delay → exceeds replica.lag.time.max.ms → ISR shrink (Series ⑨ ISR shrink)
   └─ acks=all producers hit NotEnoughReplicas as ISR falls short

In other words, oversizing the heap is, counterintuitively, dangerous. A large heap means more objects for a single GC to sweep, so pauses grow longer and the cascade above becomes more likely. Resist the temptation of "we have lots of RAM, so let's make the heap big too": it is safer to keep the heap small and monitor the pause distribution with GC logs.

# Enable GC logging (observe pause-time distribution)
export KAFKA_GC_LOG_OPTS="-Xlog:gc*:file=/var/log/kafka/gc.log:time,uptime:filecount=10,filesize=100M"

5. Network and Placement

NIC bandwidth as a real ceiling

If throughput stops climbing past some point in your benchmark, the culprit is often not the disk but NIC bandwidth. Once you factor in replication, the network load is larger than it seems.

e.g., RF=3, producer ingress 300 MB/s, network on one broker
  - Ingress (producer → leader):              ~100 MB/s
  - Replication out (leader → 2 followers):    ~200 MB/s
  - Consumer egress (follower/leader → consumers): variable (scales with number of consumer groups)
  → 1 GbE (~125 MB/s) is immediately the limit. 10 GbE and up recommended.

Replication consumes (RF-1)× more network than producer ingress, and consumer fan-out piles on top. Do not leave out of your capacity plan the fact that more consumer groups multiply the egress bandwidth.

Cross-AZ/region placement cost

If you spread brokers across multiple availability zones (AZs) for durability, replication traffic crosses AZ boundaries, creating data-transfer cost and extra latency. Across regions, the cost and latency are larger still.

  • Cross-AZ: latency is usually a few ms, but cloud providers charge for inter-AZ transfer. With acks=all, the follower-ACK round trip is directly reflected in latency.
  • Cross-region synchronous replication: not recommended. RTTs of tens of ms or more are added to every write, collapsing throughput. The standard for cross-region is asynchronous replication such as MirrorMaker 2, covered in the separate DR series (① DR Design RPO/RTO, ⑮ MirrorMaker 2 Setup).

Using broker.rack and rack-aware replica placement, you can ensure replicas of the same partition land in different AZs, preserving data even through an AZ failure.


6. The Payoff — Three Combined Tuning Profiles

It is time to tie everything so far (page cache, OS, disk, JVM, network, and the producer/broker settings from Parts 1 and 2) into one. In practice, tuning ultimately converges on a single question: "Of throughput, latency, and durability, which does our workload want most?" Here are three representative profiles, expressed in concrete configuration keys.

The values in the tables below are starting points — not absolute answers, but baselines you must validate in your own environment with the benchmarking method from Part 1 (kafka-producer-perf-test / kafka-consumer-perf-test, changing one variable at a time).

6.1 Throughput-Optimized Profile

For cases like bulk log ingestion or analytics pipelines, where the goal is "as many messages per second as possible" and tens to hundreds of ms of latency is acceptable.

LayerSettingValueIntent
Producerbatch.size256000512000Maximize messages per request with large batches
Producerlinger.ms50100Wait for batches to fill more
Producercompression.typezstd or lz4Cut network/disk transfer volume
Producerbuffer.memory134217728 (128MB)Enough buffer to hold large batches
Produceracks1Favor throughput (accept durability cost — see note)
Consumerfetch.min.bytes1048576 (1MB)Fetch large amounts at once, reducing request count
Consumerfetch.max.wait.ms500Allow waiting until min.bytes fills
Consumermax.partition.fetch.bytes5242880 (5MB)Allow large per-partition fetches
Brokernum.io.threadsdisk count × 2–3Widen disk I/O parallelism
Brokernum.network.threadsproportional to coresScale network handling
Brokernum.replica.fetchers48Improve follower replication parallelism

Note (durability cost): acks=1 treats success once the leader receives the message, so if the leader dies before replication, the message can be lost (see Series ④ acks/min.insync.replicas, ③ Message-Loss Scenarios). If you cannot tolerate loss at all, keep acks=all and make up throughput with batching and compression.

6.2 Latency-Optimized Profile

For cases like payment notifications or real-time trading signals, where the goal is "even one record, fast." You concede some throughput.

LayerSettingValueIntent
Producerlinger.ms0Send immediately without waiting to accumulate
Producerbatch.size16384 (default) or smallerMinimize batch waiting
Producercompression.typelz4 or noneMinimize compression CPU delay (lz4 is fast)
Producermax.in.flight.requests.per.connection5 (+ enable.idempotence=true)Keep pipelining while preserving order via idempotence
Produceracks1 (latency-first) / all (durability too)Choose the trade-off
Consumerfetch.min.bytes1Return as soon as any data exists
Consumerfetch.max.wait.ms1050Minimize fetch waiting
Consumermax.poll.records100500Fewer records per poll shortens processing latency
Brokerreplica.fetch.wait.max.mssmallShorten replication delay
Topicpartition countspread sufficientlyEase queuing latency through consumer parallelism

The key is to remove all "waiting." With linger.ms=0, fetch.min.bytes=1, and a small fetch.max.wait.ms, nothing waits anywhere for a batch to fill. Just be sure to enable enable.idempotence=true so order is preserved on retries even at max.in.flight=5 (Series ⑩ Ordering Guarantees).

6.3 Durability-Optimized Profile

For cases like financial ledgers or audit logs, where "absolutely no loss" is the top priority. You choose data safety even at some cost to throughput and latency.

LayerSettingValueIntent
ProduceracksallSuccess only once all ISRs receive it
Producerenable.idempotencetruePrevent duplicates and order breaks on retry
ProducerretriesInteger.MAX_VALUERetry persistently through transient failures
Producermax.in.flight.requests.per.connection5 (with idempotence)Keep ordering guarantees
Producerdelivery.timeout.mssufficiently largeLet retries run to completion
Topicreplication.factor (RF)3Preserve data through two simultaneous broker failures
Topicmin.insync.replicas2Reject writes when fewer than 2 ISRs (avoid half-written data)
Broker/Topicunclean.leader.election.enablefalseBlock a lagging replica from becoming leader and losing data

This combination gathers in one place the safeguards covered in Series ③ (Message-Loss Scenarios), ④ (acks / min.insync.replicas), and ⑤ (Unclean Leader Election). In particular, min.insync.replicas=2 and acks=all must work as a pair to mean anything. With RF=3 and MISR=2, writes continue even if one broker dies; if two die, writes are rejected — choosing explicit failure over unguarded loss.

Profiles at a glance

GoalThroughputLatencyDurabilityRepresentative workload
Throughput-optimized○ (◎ if acks=all)Log ingestion, batch analytics
Latency-optimizedReal-time notifications, trading
Durability-optimizedFinancial ledgers, audit logs

7. Mapping the Three Goals to Their Dominant Knobs

Here is a single diagram of which "knobs" each profile turns, and in which direction.

Loading diagram…

The key point: all three branches sit on the same shared foundation (OS and hardware). Whichever profile you pick, if the floor — page cache, GC, disk, NIC — is weak, the upper-layer settings will not behave as intended.


8. Series Wrap-Up — Performance Parts 1–3

The Kafka Performance mini-series boils down to one line: "measure → change one variable at a time → respect the OS." Here are the three parts at a glance.

PartTopicCore message
① Benchmarking methodologyMeasure throughput/latency with kafka-*-perf-testTuning without measurement is guessing. Change one variable at a time
② Broker internals, threads, flushI/O and network threads, page-cache flushDon't force flushes — trust OS write-back
③ OS, hardware, combined profilesPage cache, zero-copy, GC, disk, three profilesHalf of tuning lives outside Kafka (OS and hardware)

The principles running through all three parts do not change.

  • Measure: the benchmark, not intuition, is the truth. Validate with a load resembling production.
  • Change one variable: change batch.size and acks at once and you won't know which one mattered.
  • Respect the OS: small heap, ample page cache, short GC, and recognize disk and NIC as ceilings.

Performance tuning is part of broader operations. Chasing throughput by dropping to acks=1 and inviting loss (Series ③/④); long GC pauses triggering a rebalance storm (⑥) and ISR shrink (⑨); a disk-full event (⑦) stalling partitions — all of these are evidence that "performance" and "stability" are one body. Durability and replication design also connect directly to the DR series (① DR Design RPO/RTO, ⑮ MirrorMaker 2 Setup, ⑯ Offset Translation, ⑰ Failover/Failback Runbook). Don't view performance, operations, and DR separately — start from a single set of workload requirements and tune them together.


Wrapping up

  • Kafka's speed comes from borrowing OS capabilities directly — sequential I/O + OS page cache + zero-copy (sendfile). That is why the first principle is to keep the heap small and yield RAM to the page cache.
  • At the OS level, vm.swappiness=1, XFS + noatime, generous file-descriptor ulimits, and high-bandwidth TCP buffers are items you should not leave at their defaults.
  • For disks, choose between JBOD (max throughput) and RAID10 (broker-level fault tolerance) based on your workload and replication strategy.
  • A GC pause is the trigger for a cascade of failures. Aim for short, predictable pauses with G1GC, and don't oversize the heap.
  • The three profiles — throughput, latency, durability — are starting points. Always validate them in your environment with the Part 1 benchmarking method.
  • The one line running through the series: measure, change one variable at a time, and respect the OS.

References


— The Data Dynamics Engineering Team