Complete Guide to Linux Kernel Parameter Tuning for Hadoop, Impala, Spark, and NiFi
A comprehensive guide to essential Linux sysctl.conf, GRUB, limits.conf, and disk I/O parameters for big data workload systems (Hadoop, Impala, Spark, NiFi), including parameter meanings, recommended values, and issues caused by misconfiguration.
When operating big data processing systems such as Hadoop, Impala, Spark, and NiFi, failing to properly tune Linux OS-level kernel parameters leads to various problems including performance degradation, OOM (Out of Memory), network timeouts, and file descriptor exhaustion. This guide covers the meaning of each parameter, recommended values, and the actual issues that arise from misconfiguration.
1. Overview
Big data workload systems have fundamentally different resource usage patterns compared to typical web or application servers.
| Characteristic | General Server | Big Data Workload |
|---|---|---|
| Memory | Several GB, uniform usage | Tens to hundreds of GB, massive buffers/caches |
| Network | Small request/response | Large-scale data shuffle, block transfers |
| File Descriptors | Hundreds to thousands | Tens of thousands to hundreds of thousands |
| Processes/Threads | Tens to hundreds | Thousands to tens of thousands |
| Disk I/O | Random I/O dominant | Large sequential I/O |
Linux default kernel parameters are configured for general-purpose servers, so they must be adjusted for big data workloads. Configuration falls into four main areas:
- sysctl.conf — Kernel runtime parameters (memory, network, filesystem)
- GRUB Boot Parameters — Kernel options applied at boot time
- limits.conf — Per-user/process resource limits
- Disk/Filesystem — I/O scheduler, mount options
2. sysctl.conf Parameters
Configured in /etc/sysctl.conf and applied immediately with sysctl -p.
2.1 Memory Related
vm.swappiness
| Item | Details |
|---|---|
| Meaning | Controls how aggressively the kernel uses swap space instead of physical memory (0-100) |
| Default | 60 |
| Recommended | 1 (setting to 0 may trigger OOM Killer immediately) |
vm.swappiness = 1Issues from Misconfiguration:
- High value (60+): The system uses swap even when memory is available, causing dramatic performance degradation in Hadoop DataNode, Impala Daemon, and Spark Executor. Impala, being a memory-based query engine, can see query response times increase 10-100x when swap is used.
- Value of 0: Swap is never used, so OOM Killer immediately terminates processes when memory is low. DataNode or NameNode may be killed unexpectedly.
vm.dirty_ratio
| Item | Details |
|---|---|
| Meaning | Maximum percentage of total memory that can hold dirty pages (data not yet written to disk). When reached, processes start writing directly to disk |
| Default | 20 (%) |
| Recommended | 10 to 15 |
vm.dirty_ratio = 10Issues from Misconfiguration:
- Too high (30+): Dirty pages accumulate massively, then flush all at once causing I/O spikes. HDFS DataNode may fail to send block reports on time, and NiFi FlowFile repository writes may be delayed, triggering back pressure.
- Too low (5 or below): Overly frequent disk writes reduce throughput. Spark shuffle write performance drops significantly.
vm.dirty_background_ratio
| Item | Details |
|---|---|
| Meaning | Threshold at which background flush processes (pdflush/writeback) begin writing dirty pages to disk |
| Default | 10 (%) |
| Recommended | 5 |
vm.dirty_background_ratio = 5Issues from Misconfiguration:
- Too high: Background flush starts late; when dirty pages reach
vm.dirty_ratio, synchronous writes occur and applications stall on I/O wait. - Set higher than dirty_ratio: Background flush never activates, resulting in only synchronous writes — the worst-case scenario.
vm.overcommit_memory
| Item | Details |
|---|---|
| Meaning | Controls the kernel's memory overcommit policy (0: heuristic, 1: always allow, 2: strict limit) |
| Default | 0 |
| Recommended | 1 (for Hadoop/Spark environments) |
vm.overcommit_memory = 1Issues from Misconfiguration:
- Value 0 (default): The kernel may reject memory requests, causing
fork()failures. Hadoop MapReduce child tasks or Spark executors fail to create processes, leading to repeated task failures. - Value 2: Strictly blocks memory allocation beyond physical memory + swap. JVM-based big data systems reserve large amounts of virtual memory, so processes may fail to start even when actual usage is low.
vm.zone_reclaim_mode
| Item | Details |
|---|---|
| Meaning | Determines whether to reclaim memory locally when a NUMA node runs low, or allocate from remote nodes |
| Default | 0 or 1 (varies by distribution) |
| Recommended | 0 |
vm.zone_reclaim_mode = 0Issues from Misconfiguration:
- Value 1: Aggressively reclaims page cache from the local NUMA node, forcing data to be re-read from disk. HDFS read performance degrades significantly, and Impala queries experience unpredictable performance jitter. Both Cloudera and Hortonworks recommend setting this to 0.
2.2 Network Related
net.core.somaxconn
| Item | Details |
|---|---|
| Meaning | Maximum length of the listen socket backlog queue — how many concurrent connection requests can wait |
| Default | 128 |
| Recommended | 4096 or higher |
net.core.somaxconn = 4096Issues from Misconfiguration:
- Keeping default (128): In high-concurrency environments, connection refused errors occur. NiFi's ListenHTTP processor fails to receive data, or connections to Spark Shuffle Service are refused, causing shuffle fetch failures.
net.core.rmem_max / net.core.wmem_max
| Item | Details |
|---|---|
| Meaning | Maximum receive (rmem) / send (wmem) buffer size for sockets (bytes) |
| Default | 212992 (~208KB) |
| Recommended | 16777216 (16MB) or higher |
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216Issues from Misconfiguration:
- Keeping default: Network bottlenecks occur during HDFS block transfers (128MB blocks), Spark shuffle, and Impala broadcast joins. TCP window scaling fails to work properly, and actual throughput on 10Gbps networks may stall at 1-2Gbps.
net.ipv4.tcp_rmem / net.ipv4.tcp_wmem
| Item | Details |
|---|---|
| Meaning | TCP socket receive/send buffer sizes (min, default, max) |
| Default | 4096 87380 6291456 |
| Recommended | 4096 65536 16777216 |
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216Issues from Misconfiguration:
- Low max value: TCP windows cannot expand sufficiently during large data transfers, failing to utilize network bandwidth. Inter-node data replication (HDFS replication) slows down, increasing data ingestion time.
- Default value too high: All TCP connections allocate large buffers by default, wasting memory. DataNodes with thousands of connections may experience memory shortages.
net.ipv4.tcp_max_syn_backlog
| Item | Details |
|---|---|
| Meaning | Maximum size of the SYN_RECEIVED connection wait queue |
| Default | 1024 |
| Recommended | 4096 or higher |
net.ipv4.tcp_max_syn_backlog = 4096Issues from Misconfiguration:
- Keeping default: When all nodes simultaneously attempt connections after a cluster restart, the SYN queue overflows and connections are dropped. YARN NodeManager re-registration and Impala StateStore subscription renewal fail, extending cluster recovery time.
net.ipv4.ip_local_port_range
| Item | Details |
|---|---|
| Meaning | Range of ephemeral ports that can be assigned to client sockets |
| Default | 32768 60999 |
| Recommended | 10000 65535 |
net.ipv4.ip_local_port_range = 10000 65535Issues from Misconfiguration:
- Narrow port range: "Cannot assign requested address" errors occur during Spark shuffle and NiFi Site-to-Site transfers that create massive outbound connections. Ports in TIME_WAIT state become exhausted when many connections are rapidly opened and closed.
2.3 Filesystem/IPC Related
fs.file-max
| Item | Details |
|---|---|
| Meaning | Maximum number of file descriptors that can be opened system-wide |
| Default | ~100000 (varies by system) |
| Recommended | 6553600 or higher |
fs.file-max = 6553600Issues from Misconfiguration:
- Low value: "Too many open files" errors occur. HDFS DataNode uses a file descriptor per block, so nodes storing tens of thousands of blocks experience complete read/write failure when file descriptors are exhausted. NiFi also uses many file descriptors for FlowFile and Content Repository access.
fs.nr_open
| Item | Details |
|---|---|
| Meaning | Hard upper limit on file descriptors a single process can open |
| Default | 1048576 |
| Recommended | 1048576 or higher (must be greater than limits.conf nofile) |
fs.nr_open = 1048576Issues from Misconfiguration:
- Lower than limits.conf nofile: No matter how high you set nofile in limits.conf, fs.nr_open caps the actual limit, preventing processes from opening the intended number of files.
kernel.pid_max
| Item | Details |
|---|---|
| Meaning | Maximum number of PIDs that can be allocated system-wide |
| Default | 32768 |
| Recommended | 4194304 |
kernel.pid_max = 4194304Issues from Misconfiguration:
- Keeping default: When YARN runs thousands of containers, PIDs are exhausted and no new processes can be created. "fork: retry: Resource temporarily unavailable" errors occur, halting all jobs on the cluster.
3. GRUB Boot Parameters
Configured in GRUB_CMDLINE_LINUX in /etc/default/grub. After changes, run grub2-mkconfig -o /boot/grub2/grub.cfg and reboot.
3.1 Disable Transparent Huge Pages (THP)
GRUB_CMDLINE_LINUX="... transparent_hugepage=never"| Item | Details |
|---|---|
| Meaning | Controls the kernel's automatic allocation/merging of 2MB huge pages |
| Default | always or madvise |
| Recommended | never |
Issues from Misconfiguration:
- THP enabled (
always): The kernel merges memory pages (compaction) in the background, causing unpredictable latency spikes. This is one of the most common and critical misconfiguration in big data systems.- Hadoop: DataNode GC times become abnormally long, causing NameNode to mark DataNodes as dead.
- Impala: Sudden pauses during query execution lead to query timeouts.
- Spark: Extended executor GC pauses cause task failures and excessive speculation.
- NiFi: Stalls during FlowFile processing cause back pressure to cascade.
Cloudera, Hortonworks (now Cloudera), and MapR all explicitly state in their documentation that THP must be disabled.
You can also verify and disable at runtime:
# Check current status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Disable at runtime
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag3.2 NUMA Settings
GRUB_CMDLINE_LINUX="... numa=off"| Item | Details |
|---|---|
| Meaning | Disables NUMA (Non-Uniform Memory Access) interleaving, treating all memory as a single node |
| Default | NUMA enabled |
| Recommended | Depends on the environment (see below) |
NUMA Configuration Strategy:
| Environment | Recommended | Reason |
|---|---|---|
| JVM-based (Hadoop, Spark) | numa=off or numactl --interleave=all | JVM is not NUMA-aware, causing memory skew |
| Impala (C++ based) | Keep NUMA enabled + vm.zone_reclaim_mode=0 | Impala supports NUMA-aware memory allocation |
Issues from Misconfiguration:
- NUMA enabled with JVM workloads: JVM heap memory concentrates on one NUMA node, causing frequent remote memory access with 2-3x increased memory access latency. GC times increase and throughput drops.
- NUMA disabled with Impala: Impala cannot perform NUMA topology-based optimizations, reducing memory access performance.
3.3 CPU Frequency Scaling / Power Management
GRUB_CMDLINE_LINUX="... intel_pstate=disable processor.max_cstate=1"Or configure in BIOS:
| Item | Details |
|---|---|
| Meaning | Controls CPU frequency scaling and power state transitions based on load |
| Default | Power saving enabled (powersave or ondemand governor) |
| Recommended | performance governor, limit C-States |
# Set CPU governor to performance
cpupower frequency-set -g performance
# Or apply to all CPUs
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $cpu
doneIssues from Misconfiguration:
- Power saving enabled: CPU takes tens to hundreds of microseconds to wake from deep C-states. For Impala, which processes many short requests, this latency accumulates and increases query latency.
ondemandgovernor: CPU frequency ramp-up takes time, causing reduced performance at the start of burst workloads. Early stages of Spark jobs may experience degraded performance.
4. Limits Configuration (/etc/security/limits.conf)
Configured in /etc/security/limits.conf or /etc/security/limits.d/ directory.
4.1 nofile (Open File Descriptors)
* soft nofile 65536
* hard nofile 65536| Item | Details |
|---|---|
| Meaning | Maximum number of file descriptors a process can open |
| Default | 1024 (soft), 4096 (hard) |
| Recommended | 65536 or higher (131072 recommended for Impala and NiFi) |
Issues from Misconfiguration:
- Keeping default (1024):
- HDFS DataNode: "Too many open files" errors occur as block count increases, causing block read/write failures.
- Impala: Queries scanning many partitions fail due to insufficient file descriptors.
- NiFi: Content Repository access fails when processing large numbers of concurrent FlowFiles.
- Spark: Reading shuffle files with many partitions fails.
4.2 nproc (Max User Processes)
* soft nproc 65536
* hard nproc 65536| Item | Details |
|---|---|
| Meaning | Maximum number of processes (including threads) per user |
| Default | 4096 |
| Recommended | 65536 or higher |
Issues from Misconfiguration:
- Keeping default (4096): JVM creates an OS thread for each Java thread. In environments where Spark Executors use hundreds of threads, YARN Containers fail to create processes with "unable to create native thread" errors. This error is easily confused with OOM, but the actual cause is the nproc limit.
4.3 memlock (Memory Lock)
* soft memlock unlimited
* hard memlock unlimited| Item | Details |
|---|---|
| Meaning | Maximum amount of memory (KB) a process can lock to prevent swapping |
| Default | 64 |
| Recommended | unlimited |
Issues from Misconfiguration:
- Keeping default:
- Impala: Impala locks memory to prevent swapping for performance. With low memlock limits, the Impala daemon may fail to start or unable to lock memory, causing swap usage.
- HDFS DataNode: Memory-mapped file locks for short-circuit read functionality may fail.
5. Disk/Filesystem Optimization
5.1 I/O Scheduler
| Item | Details |
|---|---|
| Meaning | Algorithm that determines the order of disk I/O requests |
| HDD Recommended | deadline |
| SSD/NVMe Recommended | noop or none (multi-queue) |
# Check current scheduler
cat /sys/block/sda/queue/scheduler
# Runtime change (HDD)
echo deadline > /sys/block/sda/queue/scheduler
# Runtime change (SSD)
echo noop > /sys/block/sda/queue/schedulerPermanent configuration (GRUB):
GRUB_CMDLINE_LINUX="... elevator=deadline"Or via udev rules:
# /etc/udev/rules.d/60-scheduler.rules
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="deadline"
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="noop"
ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="none"Issues from Misconfiguration:
- Using
cfqon SSD: CFQ (Completely Fair Queuing) is designed to minimize seek time on HDDs. On SSDs, it adds unnecessary request sorting and wait times, degrading IOPS performance by 30-50%. - Using
noopon HDD: Without request order optimization, disk heads move inefficiently, degrading sequential read performance.
5.2 Mount Options
# /etc/fstab example
/dev/sdb1 /data1 ext4 defaults,noatime,nodiratime 0 2
/dev/sdc1 /data2 xfs defaults,noatime,nodiratime 0 2| Option | Meaning |
|---|---|
noatime | Do not update file access time on read |
nodiratime | Do not update directory access time on read |
Issues from Misconfiguration:
- Without
noatime: Every file read triggers a metadata write. When HDFS DataNode reads blocks, additional write I/O occurs on disk, increasing unnecessary disk load by 20-30% in read-heavy workloads.
5.3 Read-ahead Settings
| Item | Details |
|---|---|
| Meaning | Amount of data the kernel pre-reads when it detects sequential reads (in 512-byte sectors) |
| Default | 256 (128KB) |
| Recommended | 2048 to 4096 (1MB to 2MB) |
# Check current value
blockdev --getra /dev/sda
# Change setting
blockdev --setra 2048 /dev/sdaIssues from Misconfiguration:
- Too low: Frequent disk seeks when sequentially reading HDFS 128MB blocks, reducing throughput. MapReduce and Spark full table scan performance suffers.
- Too high (8192+): Excessive pre-reading of unnecessary data in random I/O workloads wastes memory and delays other I/O requests.
6. Workload-Specific Recommended Settings Summary
6.1 Hadoop (HDFS/YARN)
# /etc/sysctl.conf
vm.swappiness = 1
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.overcommit_memory = 1
vm.zone_reclaim_mode = 0
fs.file-max = 6553600
kernel.pid_max = 4194304
net.core.somaxconn = 4096
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.ip_local_port_range = 10000 65535# /etc/security/limits.conf
hdfs soft nofile 65536
hdfs hard nofile 65536
hdfs soft nproc 65536
hdfs hard nproc 65536
yarn soft nofile 65536
yarn hard nofile 65536
yarn soft nproc 65536
yarn hard nproc 655366.2 Impala
Impala is C++-based and manages memory directly, requiring additional settings.
# Additional sysctl settings
net.ipv4.tcp_max_syn_backlog = 4096
# /etc/security/limits.conf
impala soft nofile 131072
impala hard nofile 131072
impala soft nproc 65536
impala hard nproc 65536
impala soft memlock unlimited
impala hard memlock unlimitedFor Impala, THP disabling and memlock unlimited are particularly critical. Missing just these two settings can degrade query performance by 10x or more.
6.3 Spark
# Additional Spark considerations
vm.overcommit_memory = 1 # Prevent Executor fork failures
kernel.pid_max = 4194304 # Support large container counts
# /etc/security/limits.conf
spark soft nofile 65536
spark hard nofile 65536
spark soft nproc 65536
spark hard nproc 65536With Spark Dynamic Allocation, executor counts fluctuate rapidly, so nproc and pid_max settings must be sufficient.
6.4 NiFi
NiFi handles massive concurrent connections and file I/O, making file descriptor and network settings critical.
# /etc/security/limits.conf
nifi soft nofile 131072
nifi hard nofile 131072
nifi soft nproc 65536
nifi hard nproc 65536
# Additional NiFi sysctl
net.core.somaxconn = 4096
net.ipv4.ip_local_port_range = 10000 65535For NiFi, it is also recommended to use
-XX:+UseG1GCin bootstrap.conf JVM options and adjust thread pool sizes innifi.properties.
7. Applying and Verifying Settings
7.1 Application Order
1. Edit /etc/sysctl.conf → Apply with sysctl -p
2. Edit /etc/security/limits.conf → Requires re-login
3. Edit /etc/default/grub → Requires grub2-mkconfig + reboot
4. Disk I/O settings → Runtime immediate (permanent via udev rules)
5. Mount options → Remount or reboot
7.2 Verification Script
Use the following script to check all key settings at once:
#!/bin/bash
echo "=== Memory ==="
echo "vm.swappiness = $(sysctl -n vm.swappiness)"
echo "vm.dirty_ratio = $(sysctl -n vm.dirty_ratio)"
echo "vm.dirty_background = $(sysctl -n vm.dirty_background_ratio)"
echo "vm.overcommit_memory = $(sysctl -n vm.overcommit_memory)"
echo "vm.zone_reclaim_mode = $(sysctl -n vm.zone_reclaim_mode)"
echo ""
echo "=== Network ==="
echo "somaxconn = $(sysctl -n net.core.somaxconn)"
echo "rmem_max = $(sysctl -n net.core.rmem_max)"
echo "wmem_max = $(sysctl -n net.core.wmem_max)"
echo "tcp_rmem = $(sysctl -n net.ipv4.tcp_rmem)"
echo "tcp_wmem = $(sysctl -n net.ipv4.tcp_wmem)"
echo "ip_local_port_range = $(sysctl -n net.ipv4.ip_local_port_range)"
echo ""
echo "=== File System ==="
echo "fs.file-max = $(sysctl -n fs.file-max)"
echo "kernel.pid_max = $(sysctl -n kernel.pid_max)"
echo ""
echo "=== THP ==="
echo "THP enabled = $(cat /sys/kernel/mm/transparent_hugepage/enabled)"
echo "THP defrag = $(cat /sys/kernel/mm/transparent_hugepage/defrag)"
echo ""
echo "=== Limits (current user) ==="
echo "nofile soft = $(ulimit -Sn)"
echo "nofile hard = $(ulimit -Hn)"
echo "nproc soft = $(ulimit -Su)"
echo "nproc hard = $(ulimit -Hu)"
echo ""
echo "=== Disk I/O ==="
for disk in /sys/block/sd*/queue/scheduler; do
echo "$(basename $(dirname $(dirname $disk))) scheduler = $(cat $disk)"
done
for disk in /sys/block/sd*; do
echo "$(basename $disk) readahead = $(blockdev --getra /dev/$(basename $disk) 2>/dev/null)"
done
echo ""
echo "=== Mount Options ==="
mount | grep -E "^/dev/" | awk '{print $1, $3, $6}'7.3 Important Notes
- On production clusters, change only one parameter at a time and monitor its effect.
- sysctl settings persist across reboots, but values changed directly with
echocommands reset on reboot. - limits.conf changes only apply to new sessions. Existing processes are not affected.
- GRUB changes require running
grub2-mkconfigfollowed by a reboot.