linuxkernelsysctlbigdata

Complete Guide to Linux Kernel Parameter Tuning for Hadoop, Impala, Spark, and NiFi

A comprehensive guide to essential Linux sysctl.conf, GRUB, limits.conf, and disk I/O parameters for big data workload systems (Hadoop, Impala, Spark, NiFi), including parameter meanings, recommended values, and issues caused by misconfiguration.

Data DynamicsApril 14, 202617 min read

When operating big data processing systems such as Hadoop, Impala, Spark, and NiFi, failing to properly tune Linux OS-level kernel parameters leads to various problems including performance degradation, OOM (Out of Memory), network timeouts, and file descriptor exhaustion. This guide covers the meaning of each parameter, recommended values, and the actual issues that arise from misconfiguration.

1. Overview

Big data workload systems have fundamentally different resource usage patterns compared to typical web or application servers.

Characteristic	General Server	Big Data Workload
Memory	Several GB, uniform usage	Tens to hundreds of GB, massive buffers/caches
Network	Small request/response	Large-scale data shuffle, block transfers
File Descriptors	Hundreds to thousands	Tens of thousands to hundreds of thousands
Processes/Threads	Tens to hundreds	Thousands to tens of thousands
Disk I/O	Random I/O dominant	Large sequential I/O

Linux default kernel parameters are configured for general-purpose servers, so they must be adjusted for big data workloads. Configuration falls into four main areas:

sysctl.conf — Kernel runtime parameters (memory, network, filesystem)
GRUB Boot Parameters — Kernel options applied at boot time
limits.conf — Per-user/process resource limits
Disk/Filesystem — I/O scheduler, mount options

2. sysctl.conf Parameters

Configured in /etc/sysctl.conf and applied immediately with sysctl -p.

vm.swappiness

Item	Details
Meaning	Controls how aggressively the kernel uses swap space instead of physical memory (0-100)
Default	60
Recommended	`1` (setting to 0 may trigger OOM Killer immediately)

vm.swappiness = 1

Issues from Misconfiguration:

High value (60+): The system uses swap even when memory is available, causing dramatic performance degradation in Hadoop DataNode, Impala Daemon, and Spark Executor. Impala, being a memory-based query engine, can see query response times increase 10-100x when swap is used.
Value of 0: Swap is never used, so OOM Killer immediately terminates processes when memory is low. DataNode or NameNode may be killed unexpectedly.

vm.dirty_ratio

Item	Details
Meaning	Maximum percentage of total memory that can hold dirty pages (data not yet written to disk). When reached, processes start writing directly to disk
Default	20 (%)
Recommended	`10` to `15`

vm.dirty_ratio = 10

Issues from Misconfiguration:

Too high (30+): Dirty pages accumulate massively, then flush all at once causing I/O spikes. HDFS DataNode may fail to send block reports on time, and NiFi FlowFile repository writes may be delayed, triggering back pressure.
Too low (5 or below): Overly frequent disk writes reduce throughput. Spark shuffle write performance drops significantly.

vm.dirty_background_ratio

Item	Details
Meaning	Threshold at which background flush processes (pdflush/writeback) begin writing dirty pages to disk
Default	10 (%)
Recommended	`5`

vm.dirty_background_ratio = 5

Issues from Misconfiguration:

Too high: Background flush starts late; when dirty pages reach vm.dirty_ratio, synchronous writes occur and applications stall on I/O wait.
Set higher than dirty_ratio: Background flush never activates, resulting in only synchronous writes — the worst-case scenario.

vm.overcommit_memory

Item	Details
Meaning	Controls the kernel's memory overcommit policy (0: heuristic, 1: always allow, 2: strict limit)
Default	0
Recommended	`1` (for Hadoop/Spark environments)

vm.overcommit_memory = 1

Issues from Misconfiguration:

Value 0 (default): The kernel may reject memory requests, causing fork() failures. Hadoop MapReduce child tasks or Spark executors fail to create processes, leading to repeated task failures.
Value 2: Strictly blocks memory allocation beyond physical memory + swap. JVM-based big data systems reserve large amounts of virtual memory, so processes may fail to start even when actual usage is low.

vm.zone_reclaim_mode

Item	Details
Meaning	Determines whether to reclaim memory locally when a NUMA node runs low, or allocate from remote nodes
Default	0 or 1 (varies by distribution)
Recommended	`0`

vm.zone_reclaim_mode = 0

Issues from Misconfiguration:

Value 1: Aggressively reclaims page cache from the local NUMA node, forcing data to be re-read from disk. HDFS read performance degrades significantly, and Impala queries experience unpredictable performance jitter. Both Cloudera and Hortonworks recommend setting this to 0.

net.core.somaxconn

Item	Details
Meaning	Maximum length of the listen socket backlog queue — how many concurrent connection requests can wait
Default	128
Recommended	`4096` or higher

net.core.somaxconn = 4096

Issues from Misconfiguration:

Keeping default (128): In high-concurrency environments, connection refused errors occur. NiFi's ListenHTTP processor fails to receive data, or connections to Spark Shuffle Service are refused, causing shuffle fetch failures.

net.core.rmem_max / net.core.wmem_max

Item	Details
Meaning	Maximum receive (rmem) / send (wmem) buffer size for sockets (bytes)
Default	212992 (~208KB)
Recommended	`16777216` (16MB) or higher

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

Issues from Misconfiguration:

Keeping default: Network bottlenecks occur during HDFS block transfers (128MB blocks), Spark shuffle, and Impala broadcast joins. TCP window scaling fails to work properly, and actual throughput on 10Gbps networks may stall at 1-2Gbps.

net.ipv4.tcp_rmem / net.ipv4.tcp_wmem

Item	Details
Meaning	TCP socket receive/send buffer sizes (min, default, max)
Default	4096 87380 6291456
Recommended	`4096 65536 16777216`

net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

Issues from Misconfiguration:

Low max value: TCP windows cannot expand sufficiently during large data transfers, failing to utilize network bandwidth. Inter-node data replication (HDFS replication) slows down, increasing data ingestion time.
Default value too high: All TCP connections allocate large buffers by default, wasting memory. DataNodes with thousands of connections may experience memory shortages.

net.ipv4.tcp_max_syn_backlog

Item	Details
Meaning	Maximum size of the SYN_RECEIVED connection wait queue
Default	1024
Recommended	`4096` or higher

net.ipv4.tcp_max_syn_backlog = 4096

Issues from Misconfiguration:

Keeping default: When all nodes simultaneously attempt connections after a cluster restart, the SYN queue overflows and connections are dropped. YARN NodeManager re-registration and Impala StateStore subscription renewal fail, extending cluster recovery time.

net.ipv4.ip_local_port_range

Item	Details
Meaning	Range of ephemeral ports that can be assigned to client sockets
Default	32768 60999
Recommended	`10000 65535`

net.ipv4.ip_local_port_range = 10000 65535

Issues from Misconfiguration:

Narrow port range: "Cannot assign requested address" errors occur during Spark shuffle and NiFi Site-to-Site transfers that create massive outbound connections. Ports in TIME_WAIT state become exhausted when many connections are rapidly opened and closed.

fs.file-max

Item	Details
Meaning	Maximum number of file descriptors that can be opened system-wide
Default	~100000 (varies by system)
Recommended	`6553600` or higher

fs.file-max = 6553600

Issues from Misconfiguration:

Low value: "Too many open files" errors occur. HDFS DataNode uses a file descriptor per block, so nodes storing tens of thousands of blocks experience complete read/write failure when file descriptors are exhausted. NiFi also uses many file descriptors for FlowFile and Content Repository access.

fs.nr_open

Item	Details
Meaning	Hard upper limit on file descriptors a single process can open
Default	1048576
Recommended	`1048576` or higher (must be greater than limits.conf nofile)

fs.nr_open = 1048576

Issues from Misconfiguration:

Lower than limits.conf nofile: No matter how high you set nofile in limits.conf, fs.nr_open caps the actual limit, preventing processes from opening the intended number of files.

kernel.pid_max

Item	Details
Meaning	Maximum number of PIDs that can be allocated system-wide
Default	32768
Recommended	`4194304`

kernel.pid_max = 4194304

Issues from Misconfiguration:

Keeping default: When YARN runs thousands of containers, PIDs are exhausted and no new processes can be created. "fork: retry: Resource temporarily unavailable" errors occur, halting all jobs on the cluster.

3. GRUB Boot Parameters

Configured in GRUB_CMDLINE_LINUX in /etc/default/grub. After changes, run grub2-mkconfig -o /boot/grub2/grub.cfg and reboot.

3.1 Disable Transparent Huge Pages (THP)

GRUB_CMDLINE_LINUX="... transparent_hugepage=never"

Item	Details
Meaning	Controls the kernel's automatic allocation/merging of 2MB huge pages
Default	`always` or `madvise`
Recommended	`never`

Issues from Misconfiguration:

THP enabled (always): The kernel merges memory pages (compaction) in the background, causing unpredictable latency spikes. This is one of the most common and critical misconfiguration in big data systems.
- Hadoop: DataNode GC times become abnormally long, causing NameNode to mark DataNodes as dead.
- Impala: Sudden pauses during query execution lead to query timeouts.
- Spark: Extended executor GC pauses cause task failures and excessive speculation.
- NiFi: Stalls during FlowFile processing cause back pressure to cascade.

Cloudera, Hortonworks (now Cloudera), and MapR all explicitly state in their documentation that THP must be disabled.

You can also verify and disable at runtime:

# Check current status
cat /sys/kernel/mm/transparent_hugepage/enabled
 
# Disable at runtime
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

3.2 NUMA Settings

GRUB_CMDLINE_LINUX="... numa=off"

Item	Details
Meaning	Disables NUMA (Non-Uniform Memory Access) interleaving, treating all memory as a single node
Default	NUMA enabled
Recommended	Depends on the environment (see below)

NUMA Configuration Strategy:

Environment	Recommended	Reason
JVM-based (Hadoop, Spark)	`numa=off` or `numactl --interleave=all`	JVM is not NUMA-aware, causing memory skew
Impala (C++ based)	Keep NUMA enabled + `vm.zone_reclaim_mode=0`	Impala supports NUMA-aware memory allocation

Issues from Misconfiguration:

NUMA enabled with JVM workloads: JVM heap memory concentrates on one NUMA node, causing frequent remote memory access with 2-3x increased memory access latency. GC times increase and throughput drops.
NUMA disabled with Impala: Impala cannot perform NUMA topology-based optimizations, reducing memory access performance.

3.3 CPU Frequency Scaling / Power Management

GRUB_CMDLINE_LINUX="... intel_pstate=disable processor.max_cstate=1"

Or configure in BIOS:

Item	Details
Meaning	Controls CPU frequency scaling and power state transitions based on load
Default	Power saving enabled (`powersave` or `ondemand` governor)
Recommended	`performance` governor, limit C-States

# Set CPU governor to performance
cpupower frequency-set -g performance
 
# Or apply to all CPUs
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > $cpu
done

Issues from Misconfiguration:

Power saving enabled: CPU takes tens to hundreds of microseconds to wake from deep C-states. For Impala, which processes many short requests, this latency accumulates and increases query latency.
ondemand governor: CPU frequency ramp-up takes time, causing reduced performance at the start of burst workloads. Early stages of Spark jobs may experience degraded performance.

4. Limits Configuration (/etc/security/limits.conf)

Configured in /etc/security/limits.conf or /etc/security/limits.d/ directory.

4.1 nofile (Open File Descriptors)

*    soft    nofile    65536
*    hard    nofile    65536

Item	Details
Meaning	Maximum number of file descriptors a process can open
Default	1024 (soft), 4096 (hard)
Recommended	`65536` or higher (`131072` recommended for Impala and NiFi)

Issues from Misconfiguration:

Keeping default (1024):
- HDFS DataNode: "Too many open files" errors occur as block count increases, causing block read/write failures.
- Impala: Queries scanning many partitions fail due to insufficient file descriptors.
- NiFi: Content Repository access fails when processing large numbers of concurrent FlowFiles.
- Spark: Reading shuffle files with many partitions fails.

4.2 nproc (Max User Processes)

*    soft    nproc    65536
*    hard    nproc    65536

Item	Details
Meaning	Maximum number of processes (including threads) per user
Default	4096
Recommended	`65536` or higher

Issues from Misconfiguration:

Keeping default (4096): JVM creates an OS thread for each Java thread. In environments where Spark Executors use hundreds of threads, YARN Containers fail to create processes with "unable to create native thread" errors. This error is easily confused with OOM, but the actual cause is the nproc limit.

4.3 memlock (Memory Lock)

*    soft    memlock    unlimited
*    hard    memlock    unlimited

Item	Details
Meaning	Maximum amount of memory (KB) a process can lock to prevent swapping
Default	64
Recommended	`unlimited`

Issues from Misconfiguration:

Keeping default:
- Impala: Impala locks memory to prevent swapping for performance. With low memlock limits, the Impala daemon may fail to start or unable to lock memory, causing swap usage.
- HDFS DataNode: Memory-mapped file locks for short-circuit read functionality may fail.

5. Disk/Filesystem Optimization

5.1 I/O Scheduler

Item	Details
Meaning	Algorithm that determines the order of disk I/O requests
HDD Recommended	`deadline`
SSD/NVMe Recommended	`noop` or `none` (multi-queue)

# Check current scheduler
cat /sys/block/sda/queue/scheduler
 
# Runtime change (HDD)
echo deadline > /sys/block/sda/queue/scheduler
 
# Runtime change (SSD)
echo noop > /sys/block/sda/queue/scheduler

Permanent configuration (GRUB):

GRUB_CMDLINE_LINUX="... elevator=deadline"

Or via udev rules:

# /etc/udev/rules.d/60-scheduler.rules
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="deadline"
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="noop"
ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="none"

Issues from Misconfiguration:

Using cfq on SSD: CFQ (Completely Fair Queuing) is designed to minimize seek time on HDDs. On SSDs, it adds unnecessary request sorting and wait times, degrading IOPS performance by 30-50%.
Using noop on HDD: Without request order optimization, disk heads move inefficiently, degrading sequential read performance.

5.2 Mount Options

# /etc/fstab example
/dev/sdb1  /data1  ext4  defaults,noatime,nodiratime  0  2
/dev/sdc1  /data2  xfs   defaults,noatime,nodiratime  0  2

Option	Meaning
`noatime`	Do not update file access time on read
`nodiratime`	Do not update directory access time on read

Issues from Misconfiguration:

Without noatime: Every file read triggers a metadata write. When HDFS DataNode reads blocks, additional write I/O occurs on disk, increasing unnecessary disk load by 20-30% in read-heavy workloads.

5.3 Read-ahead Settings

Item	Details
Meaning	Amount of data the kernel pre-reads when it detects sequential reads (in 512-byte sectors)
Default	256 (128KB)
Recommended	`2048` to `4096` (1MB to 2MB)

# Check current value
blockdev --getra /dev/sda
 
# Change setting
blockdev --setra 2048 /dev/sda

Issues from Misconfiguration:

Too low: Frequent disk seeks when sequentially reading HDFS 128MB blocks, reducing throughput. MapReduce and Spark full table scan performance suffers.
Too high (8192+): Excessive pre-reading of unnecessary data in random I/O workloads wastes memory and delays other I/O requests.

6. Workload-Specific Recommended Settings Summary

6.1 Hadoop (HDFS/YARN)

# /etc/sysctl.conf
vm.swappiness = 1
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.overcommit_memory = 1
vm.zone_reclaim_mode = 0
fs.file-max = 6553600
kernel.pid_max = 4194304
net.core.somaxconn = 4096
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.ip_local_port_range = 10000 65535

# /etc/security/limits.conf
hdfs    soft    nofile    65536
hdfs    hard    nofile    65536
hdfs    soft    nproc     65536
hdfs    hard    nproc     65536
yarn    soft    nofile    65536
yarn    hard    nofile    65536
yarn    soft    nproc     65536
yarn    hard    nproc     65536

6.2 Impala

Impala is C++-based and manages memory directly, requiring additional settings.

# Additional sysctl settings
net.ipv4.tcp_max_syn_backlog = 4096
 
# /etc/security/limits.conf
impala  soft    nofile    131072
impala  hard    nofile    131072
impala  soft    nproc     65536
impala  hard    nproc     65536
impala  soft    memlock   unlimited
impala  hard    memlock   unlimited

For Impala, THP disabling and memlock unlimited are particularly critical. Missing just these two settings can degrade query performance by 10x or more.

6.3 Spark

# Additional Spark considerations
vm.overcommit_memory = 1      # Prevent Executor fork failures
kernel.pid_max = 4194304       # Support large container counts
 
# /etc/security/limits.conf
spark   soft    nofile    65536
spark   hard    nofile    65536
spark   soft    nproc     65536
spark   hard    nproc     65536

With Spark Dynamic Allocation, executor counts fluctuate rapidly, so nproc and pid_max settings must be sufficient.

6.4 NiFi

NiFi handles massive concurrent connections and file I/O, making file descriptor and network settings critical.

# /etc/security/limits.conf
nifi    soft    nofile    131072
nifi    hard    nofile    131072
nifi    soft    nproc     65536
nifi    hard    nproc     65536
 
# Additional NiFi sysctl
net.core.somaxconn = 4096
net.ipv4.ip_local_port_range = 10000 65535

For NiFi, it is also recommended to use -XX:+UseG1GC in bootstrap.conf JVM options and adjust thread pool sizes in nifi.properties.

7. Applying and Verifying Settings

7.1 Application Order

1. Edit /etc/sysctl.conf          → Apply with sysctl -p
2. Edit /etc/security/limits.conf → Requires re-login
3. Edit /etc/default/grub          → Requires grub2-mkconfig + reboot
4. Disk I/O settings              → Runtime immediate (permanent via udev rules)
5. Mount options                   → Remount or reboot

7.2 Verification Script

Use the following script to check all key settings at once:

#!/bin/bash
echo "=== Memory ==="
echo "vm.swappiness          = $(sysctl -n vm.swappiness)"
echo "vm.dirty_ratio         = $(sysctl -n vm.dirty_ratio)"
echo "vm.dirty_background    = $(sysctl -n vm.dirty_background_ratio)"
echo "vm.overcommit_memory   = $(sysctl -n vm.overcommit_memory)"
echo "vm.zone_reclaim_mode   = $(sysctl -n vm.zone_reclaim_mode)"
 
echo ""
echo "=== Network ==="
echo "somaxconn              = $(sysctl -n net.core.somaxconn)"
echo "rmem_max               = $(sysctl -n net.core.rmem_max)"
echo "wmem_max               = $(sysctl -n net.core.wmem_max)"
echo "tcp_rmem               = $(sysctl -n net.ipv4.tcp_rmem)"
echo "tcp_wmem               = $(sysctl -n net.ipv4.tcp_wmem)"
echo "ip_local_port_range    = $(sysctl -n net.ipv4.ip_local_port_range)"
 
echo ""
echo "=== File System ==="
echo "fs.file-max            = $(sysctl -n fs.file-max)"
echo "kernel.pid_max         = $(sysctl -n kernel.pid_max)"
 
echo ""
echo "=== THP ==="
echo "THP enabled            = $(cat /sys/kernel/mm/transparent_hugepage/enabled)"
echo "THP defrag             = $(cat /sys/kernel/mm/transparent_hugepage/defrag)"
 
echo ""
echo "=== Limits (current user) ==="
echo "nofile soft             = $(ulimit -Sn)"
echo "nofile hard             = $(ulimit -Hn)"
echo "nproc soft              = $(ulimit -Su)"
echo "nproc hard              = $(ulimit -Hu)"
 
echo ""
echo "=== Disk I/O ==="
for disk in /sys/block/sd*/queue/scheduler; do
    echo "$(basename $(dirname $(dirname $disk))) scheduler = $(cat $disk)"
done
for disk in /sys/block/sd*; do
    echo "$(basename $disk) readahead = $(blockdev --getra /dev/$(basename $disk) 2>/dev/null)"
done
 
echo ""
echo "=== Mount Options ==="
mount | grep -E "^/dev/" | awk '{print $1, $3, $6}'

7.3 Important Notes

On production clusters, change only one parameter at a time and monitor its effect.
sysctl settings persist across reboots, but values changed directly with echo commands reset on reboot.
limits.conf changes only apply to new sessions. Existing processes are not affected.
GRUB changes require running grub2-mkconfig followed by a reboot.

1. Overview

2. sysctl.conf Parameters

2.1 Memory Related

vm.swappiness

vm.dirty_ratio

vm.dirty_background_ratio

vm.overcommit_memory

vm.zone_reclaim_mode

2.2 Network Related

net.core.somaxconn

net.core.rmem_max / net.core.wmem_max

net.ipv4.tcp_rmem / net.ipv4.tcp_wmem

net.ipv4.tcp_max_syn_backlog

net.ipv4.ip_local_port_range

2.3 Filesystem/IPC Related

fs.file-max

fs.nr_open

kernel.pid_max

3. GRUB Boot Parameters

3.1 Disable Transparent Huge Pages (THP)

3.2 NUMA Settings

3.3 CPU Frequency Scaling / Power Management

4. Limits Configuration (/etc/security/limits.conf)

4.1 nofile (Open File Descriptors)

4.2 nproc (Max User Processes)

4.3 memlock (Memory Lock)

5. Disk/Filesystem Optimization

5.1 I/O Scheduler

5.2 Mount Options

5.3 Read-ahead Settings

6. Workload-Specific Recommended Settings Summary

6.1 Hadoop (HDFS/YARN)

6.2 Impala

6.3 Spark

6.4 NiFi

7. Applying and Verifying Settings

7.1 Application Order

7.2 Verification Script

7.3 Important Notes

8. References