Linux Kernel Tuning for High Performance

December 10, 2020

linux devops performance

Linux defaults are conservative. They work for most workloads. But high-performance servers—handling thousands of connections, high throughput—need tuning.

Understanding the Parameters

sysctl

Runtime kernel parameters:

# View current value
sysctl net.core.somaxconn

# Set temporarily
sysctl -w net.core.somaxconn=65535

# Set permanently in /etc/sysctl.conf
net.core.somaxconn=65535

# Apply
sysctl -p

ulimits

Per-process resource limits:

# View current limits
ulimit -a

# Set in shell
ulimit -n 65535  # Open files

# Set permanently in /etc/security/limits.conf
* soft nofile 65535
* hard nofile 65535

Network Tuning

Connection Queue

# Maximum connection backlog
net.core.somaxconn = 65535

# SYN queue size (pending connections)
net.ipv4.tcp_max_syn_backlog = 65535

# Accept backlog for each socket
net.core.netdev_max_backlog = 65535

Memory for Connections

# TCP read/write buffer sizes (min, default, max in bytes)
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Total memory for TCP
net.ipv4.tcp_mem = 786432 1048576 1572864

# Core network buffer sizes
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 262144
net.core.wmem_default = 262144

Connection Reuse

# TIME_WAIT reuse
net.ipv4.tcp_tw_reuse = 1

# Faster TIME_WAIT timeout (controversial)
net.ipv4.tcp_fin_timeout = 30

# Local port range
net.ipv4.ip_local_port_range = 1024 65535

Keep-Alive

# When to start keepalive probes (seconds)
net.ipv4.tcp_keepalive_time = 600

# Interval between probes
net.ipv4.tcp_keepalive_intvl = 60

# Number of probes before dropping
net.ipv4.tcp_keepalive_probes = 3

File Descriptors

Each connection uses a file descriptor. Default limits are too low.

# System-wide maximum
fs.file-max = 2097152

# Per-process (in limits.conf)
* soft nofile 1048576
* hard nofile 1048576

# For systemd services
LimitNOFILE=1048576

Check usage:

# System-wide
cat /proc/sys/fs/file-nr
# Output: allocated  free-but-allocated  maximum

# Per-process
ls /proc/$(pgrep nginx)/fd | wc -l

Memory Tuning

Swappiness

# How aggressively to swap (0-100)
vm.swappiness = 10  # Prefer RAM, minimize swapping

For database servers, even lower or 0.

Dirty Pages

# Start writing dirty pages at 10% of RAM
vm.dirty_ratio = 10

# Background writing at 5%
vm.dirty_background_ratio = 5

# Max age of dirty pages (centiseconds)
vm.dirty_expire_centisecs = 1500

Overcommit

# 0: heuristic overcommit
# 1: always allow
# 2: strict (check available)
vm.overcommit_memory = 0

# Ratio for mode 2
vm.overcommit_ratio = 80

Process Limits

Maximum User Processes

# /etc/security/limits.conf
* soft nproc 65535
* hard nproc 65535

Maximum Threads

# Kernel limit
kernel.pid_max = 4194304
kernel.threads-max = 4194304

Application-Specific

For Nginx

# nginx.conf
worker_rlimit_nofile 65535;
events {
    worker_connections 16384;
    use epoll;
    multi_accept on;
}

For Redis

# /etc/sysctl.conf
vm.overcommit_memory = 1
net.core.somaxconn = 65535

For PostgreSQL

# Shared memory
kernel.shmmax = 68719476736  # 64GB
kernel.shmall = 16777216

# Semaphores
kernel.sem = 1000 32000 32 1000

# HugePages (if using)
vm.nr_hugepages = 1000

Practical Example

High-traffic web server:

# /etc/sysctl.conf

# Network
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30

# TCP buffers
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# File descriptors
fs.file-max = 2097152

# Memory
vm.swappiness = 10
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

# /etc/security/limits.conf
* soft nofile 1048576
* hard nofile 1048576
* soft nproc 65535
* hard nproc 65535

Verifying Changes

# Check sysctl
sysctl -a | grep somaxconn

# Check limits for running process
cat /proc/$(pgrep -f "your-app")/limits

# Check network buffers
ss -m

# Monitor file descriptors
lsof -p $(pgrep nginx) | wc -l

Monitoring

Watch for symptoms:

# Connection tracking overflow
dmesg | grep "nf_conntrack: table full"

# SYN floods
netstat -s | grep -i syn

# Drop statistics
netstat -s | grep -i drop

Common Mistakes

Forgetting to reload: sysctl -p or restart services
Setting without measuring: Tune based on metrics, not guesses
Over-tuning: Default values work for most workloads
Ignoring the application: Kernel tuning can’t fix app inefficiency

Final Thoughts

Start with conservative changes. Measure impact. Tune incrementally.

Most applications never need kernel tuning. But when you hit limits—connection errors, latency spikes, throughput caps—these parameters matter.

Measure twice, tune once.