Blog
linuxkernelsysctlbigdata

Complete Guide to Linux Kernel Parameter Tuning for Hadoop, Impala, Spark, and NiFi

A comprehensive guide to essential Linux sysctl.conf, GRUB, limits.conf, and disk I/O parameters for big data workload systems (Hadoop, Impala, Spark, NiFi), including parameter meanings, recommended values, and issues caused by misconfiguration.

Data DynamicsApril 14, 202617 min read

When operating big data processing systems such as Hadoop, Impala, Spark, and NiFi, failing to properly tune Linux OS-level kernel parameters leads to various problems including performance degradation, OOM (Out of Memory), network timeouts, and file descriptor exhaustion. This guide covers the meaning of each parameter, recommended values, and the actual issues that arise from misconfiguration.


1. Overview

Big data workload systems have fundamentally different resource usage patterns compared to typical web or application servers.

CharacteristicGeneral ServerBig Data Workload
MemorySeveral GB, uniform usageTens to hundreds of GB, massive buffers/caches
NetworkSmall request/responseLarge-scale data shuffle, block transfers
File DescriptorsHundreds to thousandsTens of thousands to hundreds of thousands
Processes/ThreadsTens to hundredsThousands to tens of thousands
Disk I/ORandom I/O dominantLarge sequential I/O

Linux default kernel parameters are configured for general-purpose servers, so they must be adjusted for big data workloads. Configuration falls into four main areas:

  1. sysctl.conf — Kernel runtime parameters (memory, network, filesystem)
  2. GRUB Boot Parameters — Kernel options applied at boot time
  3. limits.conf — Per-user/process resource limits
  4. Disk/Filesystem — I/O scheduler, mount options

2. sysctl.conf Parameters

Configured in /etc/sysctl.conf and applied immediately with sysctl -p.

vm.swappiness

ItemDetails
MeaningControls how aggressively the kernel uses swap space instead of physical memory (0-100)
Default60
Recommended1 (setting to 0 may trigger OOM Killer immediately)
vm.swappiness = 1

Issues from Misconfiguration:

  • High value (60+): The system uses swap even when memory is available, causing dramatic performance degradation in Hadoop DataNode, Impala Daemon, and Spark Executor. Impala, being a memory-based query engine, can see query response times increase 10-100x when swap is used.
  • Value of 0: Swap is never used, so OOM Killer immediately terminates processes when memory is low. DataNode or NameNode may be killed unexpectedly.

vm.dirty_ratio

ItemDetails
MeaningMaximum percentage of total memory that can hold dirty pages (data not yet written to disk). When reached, processes start writing directly to disk
Default20 (%)
Recommended10 to 15
vm.dirty_ratio = 10

Issues from Misconfiguration:

  • Too high (30+): Dirty pages accumulate massively, then flush all at once causing I/O spikes. HDFS DataNode may fail to send block reports on time, and NiFi FlowFile repository writes may be delayed, triggering back pressure.
  • Too low (5 or below): Overly frequent disk writes reduce throughput. Spark shuffle write performance drops significantly.

vm.dirty_background_ratio

ItemDetails
MeaningThreshold at which background flush processes (pdflush/writeback) begin writing dirty pages to disk
Default10 (%)
Recommended5
vm.dirty_background_ratio = 5

Issues from Misconfiguration:

  • Too high: Background flush starts late; when dirty pages reach vm.dirty_ratio, synchronous writes occur and applications stall on I/O wait.
  • Set higher than dirty_ratio: Background flush never activates, resulting in only synchronous writes — the worst-case scenario.

vm.overcommit_memory

ItemDetails
MeaningControls the kernel's memory overcommit policy (0: heuristic, 1: always allow, 2: strict limit)
Default0
Recommended1 (for Hadoop/Spark environments)
vm.overcommit_memory = 1

Issues from Misconfiguration:

  • Value 0 (default): The kernel may reject memory requests, causing fork() failures. Hadoop MapReduce child tasks or Spark executors fail to create processes, leading to repeated task failures.
  • Value 2: Strictly blocks memory allocation beyond physical memory + swap. JVM-based big data systems reserve large amounts of virtual memory, so processes may fail to start even when actual usage is low.

vm.zone_reclaim_mode

ItemDetails
MeaningDetermines whether to reclaim memory locally when a NUMA node runs low, or allocate from remote nodes
Default0 or 1 (varies by distribution)
Recommended0
vm.zone_reclaim_mode = 0

Issues from Misconfiguration:

  • Value 1: Aggressively reclaims page cache from the local NUMA node, forcing data to be re-read from disk. HDFS read performance degrades significantly, and Impala queries experience unpredictable performance jitter. Both Cloudera and Hortonworks recommend setting this to 0.

net.core.somaxconn

ItemDetails
MeaningMaximum length of the listen socket backlog queue — how many concurrent connection requests can wait
Default128
Recommended4096 or higher
net.core.somaxconn = 4096

Issues from Misconfiguration:

  • Keeping default (128): In high-concurrency environments, connection refused errors occur. NiFi's ListenHTTP processor fails to receive data, or connections to Spark Shuffle Service are refused, causing shuffle fetch failures.

net.core.rmem_max / net.core.wmem_max

ItemDetails
MeaningMaximum receive (rmem) / send (wmem) buffer size for sockets (bytes)
Default212992 (~208KB)
Recommended16777216 (16MB) or higher
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

Issues from Misconfiguration:

  • Keeping default: Network bottlenecks occur during HDFS block transfers (128MB blocks), Spark shuffle, and Impala broadcast joins. TCP window scaling fails to work properly, and actual throughput on 10Gbps networks may stall at 1-2Gbps.

net.ipv4.tcp_rmem / net.ipv4.tcp_wmem

ItemDetails
MeaningTCP socket receive/send buffer sizes (min, default, max)
Default4096 87380 6291456
Recommended4096 65536 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

Issues from Misconfiguration:

  • Low max value: TCP windows cannot expand sufficiently during large data transfers, failing to utilize network bandwidth. Inter-node data replication (HDFS replication) slows down, increasing data ingestion time.
  • Default value too high: All TCP connections allocate large buffers by default, wasting memory. DataNodes with thousands of connections may experience memory shortages.

net.ipv4.tcp_max_syn_backlog

ItemDetails
MeaningMaximum size of the SYN_RECEIVED connection wait queue
Default1024
Recommended4096 or higher
net.ipv4.tcp_max_syn_backlog = 4096

Issues from Misconfiguration:

  • Keeping default: When all nodes simultaneously attempt connections after a cluster restart, the SYN queue overflows and connections are dropped. YARN NodeManager re-registration and Impala StateStore subscription renewal fail, extending cluster recovery time.

net.ipv4.ip_local_port_range

ItemDetails
MeaningRange of ephemeral ports that can be assigned to client sockets
Default32768 60999
Recommended10000 65535
net.ipv4.ip_local_port_range = 10000 65535

Issues from Misconfiguration:

  • Narrow port range: "Cannot assign requested address" errors occur during Spark shuffle and NiFi Site-to-Site transfers that create massive outbound connections. Ports in TIME_WAIT state become exhausted when many connections are rapidly opened and closed.

fs.file-max

ItemDetails
MeaningMaximum number of file descriptors that can be opened system-wide
Default~100000 (varies by system)
Recommended6553600 or higher
fs.file-max = 6553600

Issues from Misconfiguration:

  • Low value: "Too many open files" errors occur. HDFS DataNode uses a file descriptor per block, so nodes storing tens of thousands of blocks experience complete read/write failure when file descriptors are exhausted. NiFi also uses many file descriptors for FlowFile and Content Repository access.

fs.nr_open

ItemDetails
MeaningHard upper limit on file descriptors a single process can open
Default1048576
Recommended1048576 or higher (must be greater than limits.conf nofile)
fs.nr_open = 1048576

Issues from Misconfiguration:

  • Lower than limits.conf nofile: No matter how high you set nofile in limits.conf, fs.nr_open caps the actual limit, preventing processes from opening the intended number of files.

kernel.pid_max

ItemDetails
MeaningMaximum number of PIDs that can be allocated system-wide
Default32768
Recommended4194304
kernel.pid_max = 4194304

Issues from Misconfiguration:

  • Keeping default: When YARN runs thousands of containers, PIDs are exhausted and no new processes can be created. "fork: retry: Resource temporarily unavailable" errors occur, halting all jobs on the cluster.

3. GRUB Boot Parameters

Configured in GRUB_CMDLINE_LINUX in /etc/default/grub. After changes, run grub2-mkconfig -o /boot/grub2/grub.cfg and reboot.

3.1 Disable Transparent Huge Pages (THP)

GRUB_CMDLINE_LINUX="... transparent_hugepage=never"
ItemDetails
MeaningControls the kernel's automatic allocation/merging of 2MB huge pages
Defaultalways or madvise
Recommendednever

Issues from Misconfiguration:

  • THP enabled (always): The kernel merges memory pages (compaction) in the background, causing unpredictable latency spikes. This is one of the most common and critical misconfiguration in big data systems.
    • Hadoop: DataNode GC times become abnormally long, causing NameNode to mark DataNodes as dead.
    • Impala: Sudden pauses during query execution lead to query timeouts.
    • Spark: Extended executor GC pauses cause task failures and excessive speculation.
    • NiFi: Stalls during FlowFile processing cause back pressure to cascade.

Cloudera, Hortonworks (now Cloudera), and MapR all explicitly state in their documentation that THP must be disabled.

You can also verify and disable at runtime:

# Check current status
cat /sys/kernel/mm/transparent_hugepage/enabled
 
# Disable at runtime
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

3.2 NUMA Settings

GRUB_CMDLINE_LINUX="... numa=off"
ItemDetails
MeaningDisables NUMA (Non-Uniform Memory Access) interleaving, treating all memory as a single node
DefaultNUMA enabled
RecommendedDepends on the environment (see below)

NUMA Configuration Strategy:

EnvironmentRecommendedReason
JVM-based (Hadoop, Spark)numa=off or numactl --interleave=allJVM is not NUMA-aware, causing memory skew
Impala (C++ based)Keep NUMA enabled + vm.zone_reclaim_mode=0Impala supports NUMA-aware memory allocation

Issues from Misconfiguration:

  • NUMA enabled with JVM workloads: JVM heap memory concentrates on one NUMA node, causing frequent remote memory access with 2-3x increased memory access latency. GC times increase and throughput drops.
  • NUMA disabled with Impala: Impala cannot perform NUMA topology-based optimizations, reducing memory access performance.

3.3 CPU Frequency Scaling / Power Management

GRUB_CMDLINE_LINUX="... intel_pstate=disable processor.max_cstate=1"

Or configure in BIOS:

ItemDetails
MeaningControls CPU frequency scaling and power state transitions based on load
DefaultPower saving enabled (powersave or ondemand governor)
Recommendedperformance governor, limit C-States
# Set CPU governor to performance
cpupower frequency-set -g performance
 
# Or apply to all CPUs
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > $cpu
done

Issues from Misconfiguration:

  • Power saving enabled: CPU takes tens to hundreds of microseconds to wake from deep C-states. For Impala, which processes many short requests, this latency accumulates and increases query latency.
  • ondemand governor: CPU frequency ramp-up takes time, causing reduced performance at the start of burst workloads. Early stages of Spark jobs may experience degraded performance.

4. Limits Configuration (/etc/security/limits.conf)

Configured in /etc/security/limits.conf or /etc/security/limits.d/ directory.

4.1 nofile (Open File Descriptors)

*    soft    nofile    65536
*    hard    nofile    65536
ItemDetails
MeaningMaximum number of file descriptors a process can open
Default1024 (soft), 4096 (hard)
Recommended65536 or higher (131072 recommended for Impala and NiFi)

Issues from Misconfiguration:

  • Keeping default (1024):
    • HDFS DataNode: "Too many open files" errors occur as block count increases, causing block read/write failures.
    • Impala: Queries scanning many partitions fail due to insufficient file descriptors.
    • NiFi: Content Repository access fails when processing large numbers of concurrent FlowFiles.
    • Spark: Reading shuffle files with many partitions fails.

4.2 nproc (Max User Processes)

*    soft    nproc    65536
*    hard    nproc    65536
ItemDetails
MeaningMaximum number of processes (including threads) per user
Default4096
Recommended65536 or higher

Issues from Misconfiguration:

  • Keeping default (4096): JVM creates an OS thread for each Java thread. In environments where Spark Executors use hundreds of threads, YARN Containers fail to create processes with "unable to create native thread" errors. This error is easily confused with OOM, but the actual cause is the nproc limit.

4.3 memlock (Memory Lock)

*    soft    memlock    unlimited
*    hard    memlock    unlimited
ItemDetails
MeaningMaximum amount of memory (KB) a process can lock to prevent swapping
Default64
Recommendedunlimited

Issues from Misconfiguration:

  • Keeping default:
    • Impala: Impala locks memory to prevent swapping for performance. With low memlock limits, the Impala daemon may fail to start or unable to lock memory, causing swap usage.
    • HDFS DataNode: Memory-mapped file locks for short-circuit read functionality may fail.

5. Disk/Filesystem Optimization

5.1 I/O Scheduler

ItemDetails
MeaningAlgorithm that determines the order of disk I/O requests
HDD Recommendeddeadline
SSD/NVMe Recommendednoop or none (multi-queue)
# Check current scheduler
cat /sys/block/sda/queue/scheduler
 
# Runtime change (HDD)
echo deadline > /sys/block/sda/queue/scheduler
 
# Runtime change (SSD)
echo noop > /sys/block/sda/queue/scheduler

Permanent configuration (GRUB):

GRUB_CMDLINE_LINUX="... elevator=deadline"

Or via udev rules:

# /etc/udev/rules.d/60-scheduler.rules
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="deadline"
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="noop"
ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="none"

Issues from Misconfiguration:

  • Using cfq on SSD: CFQ (Completely Fair Queuing) is designed to minimize seek time on HDDs. On SSDs, it adds unnecessary request sorting and wait times, degrading IOPS performance by 30-50%.
  • Using noop on HDD: Without request order optimization, disk heads move inefficiently, degrading sequential read performance.

5.2 Mount Options

# /etc/fstab example
/dev/sdb1  /data1  ext4  defaults,noatime,nodiratime  0  2
/dev/sdc1  /data2  xfs   defaults,noatime,nodiratime  0  2
OptionMeaning
noatimeDo not update file access time on read
nodiratimeDo not update directory access time on read

Issues from Misconfiguration:

  • Without noatime: Every file read triggers a metadata write. When HDFS DataNode reads blocks, additional write I/O occurs on disk, increasing unnecessary disk load by 20-30% in read-heavy workloads.

5.3 Read-ahead Settings

ItemDetails
MeaningAmount of data the kernel pre-reads when it detects sequential reads (in 512-byte sectors)
Default256 (128KB)
Recommended2048 to 4096 (1MB to 2MB)
# Check current value
blockdev --getra /dev/sda
 
# Change setting
blockdev --setra 2048 /dev/sda

Issues from Misconfiguration:

  • Too low: Frequent disk seeks when sequentially reading HDFS 128MB blocks, reducing throughput. MapReduce and Spark full table scan performance suffers.
  • Too high (8192+): Excessive pre-reading of unnecessary data in random I/O workloads wastes memory and delays other I/O requests.

6.1 Hadoop (HDFS/YARN)

# /etc/sysctl.conf
vm.swappiness = 1
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.overcommit_memory = 1
vm.zone_reclaim_mode = 0
fs.file-max = 6553600
kernel.pid_max = 4194304
net.core.somaxconn = 4096
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.ip_local_port_range = 10000 65535
# /etc/security/limits.conf
hdfs    soft    nofile    65536
hdfs    hard    nofile    65536
hdfs    soft    nproc     65536
hdfs    hard    nproc     65536
yarn    soft    nofile    65536
yarn    hard    nofile    65536
yarn    soft    nproc     65536
yarn    hard    nproc     65536

6.2 Impala

Impala is C++-based and manages memory directly, requiring additional settings.

# Additional sysctl settings
net.ipv4.tcp_max_syn_backlog = 4096
 
# /etc/security/limits.conf
impala  soft    nofile    131072
impala  hard    nofile    131072
impala  soft    nproc     65536
impala  hard    nproc     65536
impala  soft    memlock   unlimited
impala  hard    memlock   unlimited

For Impala, THP disabling and memlock unlimited are particularly critical. Missing just these two settings can degrade query performance by 10x or more.

6.3 Spark

# Additional Spark considerations
vm.overcommit_memory = 1      # Prevent Executor fork failures
kernel.pid_max = 4194304       # Support large container counts
 
# /etc/security/limits.conf
spark   soft    nofile    65536
spark   hard    nofile    65536
spark   soft    nproc     65536
spark   hard    nproc     65536

With Spark Dynamic Allocation, executor counts fluctuate rapidly, so nproc and pid_max settings must be sufficient.

6.4 NiFi

NiFi handles massive concurrent connections and file I/O, making file descriptor and network settings critical.

# /etc/security/limits.conf
nifi    soft    nofile    131072
nifi    hard    nofile    131072
nifi    soft    nproc     65536
nifi    hard    nproc     65536
 
# Additional NiFi sysctl
net.core.somaxconn = 4096
net.ipv4.ip_local_port_range = 10000 65535

For NiFi, it is also recommended to use -XX:+UseG1GC in bootstrap.conf JVM options and adjust thread pool sizes in nifi.properties.


7. Applying and Verifying Settings

7.1 Application Order

1. Edit /etc/sysctl.conf          → Apply with sysctl -p
2. Edit /etc/security/limits.conf → Requires re-login
3. Edit /etc/default/grub          → Requires grub2-mkconfig + reboot
4. Disk I/O settings              → Runtime immediate (permanent via udev rules)
5. Mount options                   → Remount or reboot

7.2 Verification Script

Use the following script to check all key settings at once:

#!/bin/bash
echo "=== Memory ==="
echo "vm.swappiness          = $(sysctl -n vm.swappiness)"
echo "vm.dirty_ratio         = $(sysctl -n vm.dirty_ratio)"
echo "vm.dirty_background    = $(sysctl -n vm.dirty_background_ratio)"
echo "vm.overcommit_memory   = $(sysctl -n vm.overcommit_memory)"
echo "vm.zone_reclaim_mode   = $(sysctl -n vm.zone_reclaim_mode)"
 
echo ""
echo "=== Network ==="
echo "somaxconn              = $(sysctl -n net.core.somaxconn)"
echo "rmem_max               = $(sysctl -n net.core.rmem_max)"
echo "wmem_max               = $(sysctl -n net.core.wmem_max)"
echo "tcp_rmem               = $(sysctl -n net.ipv4.tcp_rmem)"
echo "tcp_wmem               = $(sysctl -n net.ipv4.tcp_wmem)"
echo "ip_local_port_range    = $(sysctl -n net.ipv4.ip_local_port_range)"
 
echo ""
echo "=== File System ==="
echo "fs.file-max            = $(sysctl -n fs.file-max)"
echo "kernel.pid_max         = $(sysctl -n kernel.pid_max)"
 
echo ""
echo "=== THP ==="
echo "THP enabled            = $(cat /sys/kernel/mm/transparent_hugepage/enabled)"
echo "THP defrag             = $(cat /sys/kernel/mm/transparent_hugepage/defrag)"
 
echo ""
echo "=== Limits (current user) ==="
echo "nofile soft             = $(ulimit -Sn)"
echo "nofile hard             = $(ulimit -Hn)"
echo "nproc soft              = $(ulimit -Su)"
echo "nproc hard              = $(ulimit -Hu)"
 
echo ""
echo "=== Disk I/O ==="
for disk in /sys/block/sd*/queue/scheduler; do
    echo "$(basename $(dirname $(dirname $disk))) scheduler = $(cat $disk)"
done
for disk in /sys/block/sd*; do
    echo "$(basename $disk) readahead = $(blockdev --getra /dev/$(basename $disk) 2>/dev/null)"
done
 
echo ""
echo "=== Mount Options ==="
mount | grep -E "^/dev/" | awk '{print $1, $3, $6}'

7.3 Important Notes

  • On production clusters, change only one parameter at a time and monitor its effect.
  • sysctl settings persist across reboots, but values changed directly with echo commands reset on reboot.
  • limits.conf changes only apply to new sessions. Existing processes are not affected.
  • GRUB changes require running grub2-mkconfig followed by a reboot.

8. References