The Best Price for IPv4/IPv6 Lease - Any RIR & Any Geo-Location

Go back

Diagnosing and Fixing High CPU Usage

By Raman Kumar

Updated on Sep 22, 2025

In this tutorial, we're learning diagnosing and fixing high CPU usage. A Practical Guide for Busy Web Hosts.

When a server suddenly starts chewing CPU like it’s training for a contest, we need a fast, repeatable workflow that finds and calms things down. This guide walks our ops/dev team through a step-by-step troubleshooting routine: fast triage, isolation, evidence collection, short-term mitigation, and long-term fixes. No fluff. Just the commands, what to look for, and what to do next.

Check overall load and uptime

uptime
cat /proc/loadavg

If load average is much higher than the number of CPU cores, the system runqueue is long — that’s a red flag.

See live CPU usage and which processes dominate:

top -o %CPU        # or htop for interactive view
ps aux --sort=-%cpu | head -n 10

top shows %user, %system, %iowait and %st (steal). High %iowait means I/O, not CPU. High %st usually means VM hypervisor contention.

CPU per core and quick histograms:

mpstat -P ALL 1 3

From sysstat: shows per-core usage samples

If the server is forking a million processes or the load is dominated by kernels/interrupts, proceed to the isolation steps below.

Step 1 — Identify the offending process

List top CPU consumers (persistent view):

ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 15

Track per-process CPU over time (helps separate spikes from steady usage):

pidstat -u -p ALL 1 10   # shows CPU usage for processes every second

If the workload is containerized:

docker stats --no-stream
docker top <container>

Key signs:

One PID at 90% CPU → single runaway process (can be killed/limited).
Many PIDs at small CPU → possible cron storms, threaded app, or I/O/lock issues.
High kernel % (top shows %system) → kernel work: interrupts, network stack, or filesystem.

Step 2 — Is it CPU work or waiting on I/O / network / steal?

Use top/mpstat to cross-check:

High %user → userland code (app, interpreter, JIT).
High %system → kernel activity (network, interrupts).
High %iowait → disk I/O bottleneck — CPU appears idle waiting for I/O.
High %st (steal) → hypervisor is starving our VM of CPU time; move/resize VM.

Also inspect interrupts and softirqs:

cat /proc/interrupts
watch -n1 cat /proc/softirqs

Excessive interrupts on a NIC can point to noisy network traffic or DDoS.

Step 3 — Quick mitigation

When traffic is crashing the host, act fast:

If one process is runaway. Reduce priority immediately:

sudo renice +19 -p <PID>

Temporarily limit CPU use (soft cap):

sudo cpulimit --pid <PID> --limit 40 --background

cpulimit pauses/continues the process to keep average CPU under limit. For more robust, use systemd/cgroups CPUQuota.

If traffic or DDoS suspected. Inspect connection counts:

ss -Htn state established '( sport = :http or sport = :https )' | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head

Apply immediate IP blocks / rate-limits (short term): use firewall rules (nftables/iptables) or a WAF/CDN like Cloudflare to absorb volumetric attacks. Cloud/CDN rate limiting and WAF rules are primary defenses for large DDoS.

If VM steal time is high

Contact host provider / migrate VM. High %st means physical host overcommit; increasing vCPU or moving physical host is the fix.

Step 4 — Collect forensic evidence

Snapshot CPU profile for the process (sampling profiler):

For native C/C++/Go/Python/Java processes: use perf to see where CPU cycles go:

sudo perf top -p <PID>          # live, quick look
sudo perf record -F 99 -p <PID> -g -- sleep 60
sudo perf script > out.perf

# convert to flamegraph (FlameGraph scripts from Brendan Gregg)

Flame graphs expose hot call paths and are the fastest way to find the hotspots that matter.

For interpreted languages:

PHP: enable FPM slowlog and check backtraces of slow requests (request_slowlog_timeout + slowlog). The PHP manual covers FPM pool directives. Use slowlog before killing processes so we can fix root cause.

Example FPM pool config:

request_slowlog_timeout = 5s
slowlog = /var/log/php-fpm/www-slow.log

Node.js: use 0x/clinic or node --inspect flamegraphs.

Java: jstack, jcmd, and profilers (async-profiler).

When a process is spinning with syscalls: trace it (non-invasive):

sudo strace -ttt -p <PID> -o /tmp/strace.pid.log

# watch for repeated syscalls (poll, futex, write, accept)

strace points to blocking syscalls and can quickly reveal tight loops or repeated network calls.

DB servers: check DB processlists and slow logs (MySQL SHOW PROCESSLIST;, Postgres pg_stat_activity) — sometimes CPU is from expensive queries.

Step 5 — Interpret the evidence, common causes and fixes

Match the symptom to likely causes and the pragmatic fix:

Single interpreted process high %user

Likely: inefficient code, infinite loop, expensive regex, unbounded CPU work.
Fix: profile (flamegraph), patch the hotspot, add caching, add timeouts, restart service if urgent.

Many worker processes each consuming CPU

Likely: legitimate traffic spike, misconfigured worker counts, cache miss storm.
Fix: tune web server/PHP-FPM worker counts to match CPU cores and memory, enable opcache (PHP), add object/page cache, put static assets behind CDN.

High %system or interrupts

Likely: network storm, driver issues, kernel processing (e.g., heavy packet processing).
Fix: offload to firewall/CDN, update NIC drivers, enable hardware interrupts balancing (irqbalance).

High %iowait

Likely: saturated disk IO causing processes to wait (looks like low CPU but system sluggish).
Fix: inspect iostat -x, iotop. Move DB to faster storage, add RAID, increase IOPS, or tune queries/indices.

High steal time (%st)

Likely: host oversubscription in virtualized environment; not our code.
Fix: migrate VM, increase instance size, or contact provider.

DDoS / abusive traffic

Likely: many connections, spike in SYNs, or HTTP floods.
Fix: enable CDN/WAF rate limits, block offending IPs temporarily, scale out, and implement long-term WAF rules.

Step 6 — Short-term containment commands (examples)

Graceful restart of web service (if stuck):

sudo systemctl restart php-fpm
sudo systemctl restart nginx

Pause or limit CPU while keeping service alive:

sudo systemd-run --scope -p CPUQuota=40% -p MemoryMax=500M <command>
# or set property for an existing service:
sudo systemctl set-property php-fpm.service CPUQuota=50%

Kill runaway immediately (only when necessary):

sudo kill -TERM <PID>

sleep 5; sudo kill -KILL <PID>   # escalate if it refuses to die

Block heavy attackers (one-liner; be careful):

sudo iptables -I INPUT -s 1.2.3.4 -j DROP

For web-scale attacks, prefer upstream mitigations (CDN/WAF) over local firewall rules.

Step 7 — Long-term fixes (prevent recurrence)

Profiling and code changes

Use perf/flamegraphs or language profilers to remove hotspots. Flamegraphs are the standard visual tool for CPU hotspots.

Right-size service concurrency

Tune PHP-FPM pm.* settings, Nginx worker_processes/worker_connections, database connection pools to match CPU and memory resources.

Add caching

Page cache (Varnish), object cache (Redis/Memcached), opcode cache (PHP OPcache). Caching reduces CPU by avoiding repeated work.

Observability and alerts

Export CPU metrics (node_exporter), per-process metrics, and application traces to Prometheus/Grafana or a managed APM (New Relic, Datadog, etc.). Alert on anomalous CPU, load, and %st.

Safeguards

Use systemd resource controls or cgroups to cap known non-critical processes. Use autoscaling for stateless tiers and capacity planning for DB/Stateful tiers.

DDoS readiness

Put public sites behind a CDN/WAF, implement rate limiting, and have an incident playbook. Cloud/CDN rate limiting is often the difference between a short incident and full outage.

Tools at our disposal (cheat sheet)

Monitoring/metrics: top, htop, mpstat, pidstat, iostat, netstat/ss, sar
Profiling: perf, Brendan Gregg’s FlameGraph scripts.
Tracing: strace, ltrace, bcc/eBPF tools for advanced tracing
Language profilers: PHP-FPM slowlog / xdebug / Blackfire, py-spy for Python, jstack/jcmd for Java, clinic for Node.js
Containment: renice, cpulimit, systemd-run/CPUQuota, firewall (iptables/nft), CDN/WAF

Final checklist (what we do, in order)

Triage: uptime, top, ps (90s).
Confirm cause class: CPU user vs system vs iowait vs steal.
Identify process(es) with ps, pidstat, docker stats.
Short term: renice/limit/restart or CDN block.
Forensic: perf / flamegraph / slowlogs / strace / DB slow query logs.
Fix: code/profile/tune configs/caching/scale.
Prevent: monitoring, rate limits, WAF/CDN, resource quotas (systemd/cgroups).

Check out robust instant dedicated servers, Instant KVM VPS, premium shared hosting and data center services in New Zealand

Diagnosing and Fixing High CPU Usage

By Raman Kumar

Updated on Sep 22, 2025

Check overall load and uptime

Step 1 — Identify the offending process

Step 2 — Is it CPU work or waiting on I/O / network / steal?

Step 3 — Quick mitigation

Step 4 — Collect forensic evidence

Step 5 — Interpret the evidence, common causes and fixes

Step 6 — Short-term containment commands (examples)

Step 7 — Long-term fixes (prevent recurrence)

Final checklist (what we do, in order)

Featured Category

Linux

Open Source

CMS

Web Server

Nextjs

PostgreSQL

Security

Latest Chapters

How to Install Statping on AlmaLinux 10

How to Install Statping on Ubuntu 24.04 Server

Set Up Glances Web Mode on Ubuntu 24.04

Set Up Load Balancing with HAProxy on Ubuntu

How Linux Handles Memory, OOM Killer, and Swappiness