In this tutorial, we're learning diagnosing and fixing high CPU usage. A Practical Guide for Busy Web Hosts.
When a server suddenly starts chewing CPU like it’s training for a contest, we need a fast, repeatable workflow that finds and calms things down. This guide walks our ops/dev team through a step-by-step troubleshooting routine: fast triage, isolation, evidence collection, short-term mitigation, and long-term fixes. No fluff. Just the commands, what to look for, and what to do next.
Check overall load and uptime
uptime
cat /proc/loadavg
If load average is much higher than the number of CPU cores, the system runqueue is long — that’s a red flag.
See live CPU usage and which processes dominate:
top -o %CPU # or htop for interactive view
ps aux --sort=-%cpu | head -n 10
top
shows %user
, %system
, %iowait
and %st
(steal). High %iowait
means I/O, not CPU. High %st
usually means VM hypervisor contention.
CPU per core and quick histograms:
mpstat -P ALL 1 3
From sysstat
: shows per-core usage samples
If the server is forking a million processes or the load is dominated by kernels/interrupts, proceed to the isolation steps below.
Step 1 — Identify the offending process
List top CPU consumers (persistent view):
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 15
Track per-process CPU over time (helps separate spikes from steady usage):
pidstat -u -p ALL 1 10 # shows CPU usage for processes every second
If the workload is containerized:
docker stats --no-stream
docker top <container>
Key signs:
- One PID at 90% CPU → single runaway process (can be killed/limited).
- Many PIDs at small CPU → possible cron storms, threaded app, or I/O/lock issues.
- High kernel % (top shows %system) → kernel work: interrupts, network stack, or filesystem.
Step 2 — Is it CPU work or waiting on I/O / network / steal?
Use top/mpstat to cross-check:
- High
%user
→ userland code (app, interpreter, JIT). - High
%system
→ kernel activity (network, interrupts). - High
%iowait
→ disk I/O bottleneck — CPU appears idle waiting for I/O. - High
%st
(steal) → hypervisor is starving our VM of CPU time; move/resize VM.
Also inspect interrupts and softirqs:
cat /proc/interrupts
watch -n1 cat /proc/softirqs
Excessive interrupts on a NIC can point to noisy network traffic or DDoS.
Step 3 — Quick mitigation
When traffic is crashing the host, act fast:
If one process is runaway. Reduce priority immediately:
sudo renice +19 -p <PID>
Temporarily limit CPU use (soft cap):
sudo cpulimit --pid <PID> --limit 40 --background
cpulimit
pauses/continues the process to keep average CPU under limit. For more robust, use systemd/cgroups
CPUQuota.
If traffic or DDoS suspected. Inspect connection counts:
ss -Htn state established '( sport = :http or sport = :https )' | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head
Apply immediate IP blocks / rate-limits (short term): use firewall rules (nftables/iptables) or a WAF/CDN like Cloudflare to absorb volumetric attacks. Cloud/CDN rate limiting and WAF rules are primary defenses for large DDoS.
If VM steal time is high
Contact host provider / migrate VM. High %st means physical host overcommit; increasing vCPU or moving physical host is the fix.
Step 4 — Collect forensic evidence
Snapshot CPU profile for the process (sampling profiler):
For native C/C++/Go/Python/Java processes: use perf to see where CPU cycles go:
sudo perf top -p <PID> # live, quick look
sudo perf record -F 99 -p <PID> -g -- sleep 60
sudo perf script > out.perf
# convert to flamegraph (FlameGraph scripts from Brendan Gregg)
Flame graphs expose hot call paths and are the fastest way to find the hotspots that matter.
For interpreted languages:
PHP: enable FPM slowlog and check backtraces of slow requests (request_slowlog_timeout + slowlog). The PHP manual covers FPM pool directives. Use slowlog before killing processes so we can fix root cause.
Example FPM pool config:
request_slowlog_timeout = 5s
slowlog = /var/log/php-fpm/www-slow.log
Node.js: use 0x/clinic or node --inspect flamegraphs.
Java: jstack, jcmd, and profilers (async-profiler).
When a process is spinning with syscalls: trace it (non-invasive):
sudo strace -ttt -p <PID> -o /tmp/strace.pid.log
# watch for repeated syscalls (poll, futex, write, accept)
strace points to blocking syscalls and can quickly reveal tight loops or repeated network calls.
DB servers: check DB processlists and slow logs (MySQL SHOW PROCESSLIST
;, Postgres pg_stat_activity
) — sometimes CPU is from expensive queries.
Step 5 — Interpret the evidence, common causes and fixes
Match the symptom to likely causes and the pragmatic fix:
Single interpreted process high %user
Likely: inefficient code, infinite loop, expensive regex, unbounded CPU work.
Fix: profile (flamegraph), patch the hotspot, add caching, add timeouts, restart service if urgent.
Many worker processes each consuming CPU
Likely: legitimate traffic spike, misconfigured worker counts, cache miss storm.
Fix: tune web server/PHP-FPM worker counts to match CPU cores and memory, enable opcache (PHP), add object/page cache, put static assets behind CDN.
High %system or interrupts
Likely: network storm, driver issues, kernel processing (e.g., heavy packet processing).
Fix: offload to firewall/CDN, update NIC drivers, enable hardware interrupts balancing (irqbalance).
High %iowait
Likely: saturated disk IO causing processes to wait (looks like low CPU but system sluggish).
Fix: inspect iostat -x, iotop. Move DB to faster storage, add RAID, increase IOPS, or tune queries/indices.
High steal time (%st)
Likely: host oversubscription in virtualized environment; not our code.
Fix: migrate VM, increase instance size, or contact provider.
DDoS / abusive traffic
Likely: many connections, spike in SYNs, or HTTP floods.
Fix: enable CDN/WAF rate limits, block offending IPs temporarily, scale out, and implement long-term WAF rules.
Step 6 — Short-term containment commands (examples)
Graceful restart of web service (if stuck):
sudo systemctl restart php-fpm
sudo systemctl restart nginx
Pause or limit CPU while keeping service alive:
- sudo systemd-run --scope -p CPUQuota=40% -p MemoryMax=500M <command>
- # or set property for an existing service:
- sudo systemctl set-property php-fpm.service CPUQuota=50%
Kill runaway immediately (only when necessary):
sudo kill -TERM <PID>
sleep 5; sudo kill -KILL <PID> # escalate if it refuses to die
Block heavy attackers (one-liner; be careful):
sudo iptables -I INPUT -s 1.2.3.4 -j DROP
For web-scale attacks, prefer upstream mitigations (CDN/WAF) over local firewall rules.
Step 7 — Long-term fixes (prevent recurrence)
Profiling and code changes
Use perf
/flamegraphs
or language profilers to remove hotspots. Flamegraphs are the standard visual tool for CPU hotspots.
Right-size service concurrency
Tune PHP-FPM pm.* settings, Nginx worker_processes/worker_connections, database connection pools to match CPU and memory resources.
Add caching
Page cache (Varnish), object cache (Redis/Memcached), opcode cache (PHP OPcache). Caching reduces CPU by avoiding repeated work.
Observability and alerts
Export CPU metrics (node_exporter), per-process metrics, and application traces to Prometheus/Grafana or a managed APM (New Relic, Datadog, etc.). Alert on anomalous CPU, load, and %st.
Safeguards
Use systemd resource controls or cgroups to cap known non-critical processes. Use autoscaling for stateless tiers and capacity planning for DB/Stateful tiers.
DDoS readiness
Put public sites behind a CDN/WAF, implement rate limiting, and have an incident playbook. Cloud/CDN rate limiting is often the difference between a short incident and full outage.
Tools at our disposal (cheat sheet)
- Monitoring/metrics: top, htop, mpstat, pidstat, iostat, netstat/ss, sar
- Profiling: perf, Brendan Gregg’s FlameGraph scripts.
- Tracing: strace, ltrace, bcc/eBPF tools for advanced tracing
- Language profilers: PHP-FPM slowlog / xdebug / Blackfire, py-spy for Python, jstack/jcmd for Java, clinic for Node.js
- Containment: renice, cpulimit, systemd-run/CPUQuota, firewall (iptables/nft), CDN/WAF
Final checklist (what we do, in order)
- Triage: uptime, top, ps (90s).
- Confirm cause class: CPU user vs system vs iowait vs steal.
- Identify process(es) with ps, pidstat, docker stats.
- Short term: renice/limit/restart or CDN block.
- Forensic: perf / flamegraph / slowlogs / strace / DB slow query logs.
- Fix: code/profile/tune configs/caching/scale.
- Prevent: monitoring, rate limits, WAF/CDN, resource quotas (systemd/cgroups).
Check out robust instant dedicated servers, Instant KVM VPS, premium shared hosting and data center services in New Zealand