Diagnosing and Fixing High CPU Usage

By Raman Kumar

Updated on Sep 22, 2025

In this tutorial, we're learning diagnosing and fixing high CPU usage. A Practical Guide for Busy Web Hosts.

When a server suddenly starts chewing CPU like it’s training for a contest, we need a fast, repeatable workflow that finds and calms things down. This guide walks our ops/dev team through a step-by-step troubleshooting routine: fast triage, isolation, evidence collection, short-term mitigation, and long-term fixes. No fluff. Just the commands, what to look for, and what to do next.

Check overall load and uptime

uptime
cat /proc/loadavg

If load average is much higher than the number of CPU cores, the system runqueue is long — that’s a red flag.

See live CPU usage and which processes dominate:

top -o %CPU        # or htop for interactive view
ps aux --sort=-%cpu | head -n 10

top shows %user, %system, %iowait and %st (steal). High %iowait means I/O, not CPU. High %st usually means VM hypervisor contention.

CPU per core and quick histograms:

mpstat -P ALL 1 3   

From sysstat: shows per-core usage samples

If the server is forking a million processes or the load is dominated by kernels/interrupts, proceed to the isolation steps below.

Step 1 — Identify the offending process

List top CPU consumers (persistent view):

ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -n 15

Track per-process CPU over time (helps separate spikes from steady usage):

pidstat -u -p ALL 1 10   # shows CPU usage for processes every second

If the workload is containerized:

docker stats --no-stream
docker top <container>

Key signs:

  • One PID at 90% CPU → single runaway process (can be killed/limited).
  • Many PIDs at small CPU → possible cron storms, threaded app, or I/O/lock issues.
  • High kernel % (top shows %system) → kernel work: interrupts, network stack, or filesystem.

Step 2 — Is it CPU work or waiting on I/O / network / steal?

Use top/mpstat to cross-check:

  • High %user → userland code (app, interpreter, JIT).
  • High %system → kernel activity (network, interrupts).
  • High %iowait → disk I/O bottleneck — CPU appears idle waiting for I/O.
  • High %st (steal) → hypervisor is starving our VM of CPU time; move/resize VM.

Also inspect interrupts and softirqs:

cat /proc/interrupts
watch -n1 cat /proc/softirqs

Excessive interrupts on a NIC can point to noisy network traffic or DDoS.

Step 3 — Quick mitigation 

When traffic is crashing the host, act fast:

If one process is runaway. Reduce priority immediately:

sudo renice +19 -p <PID>

Temporarily limit CPU use (soft cap):

sudo cpulimit --pid <PID> --limit 40 --background

cpulimit pauses/continues the process to keep average CPU under limit. For more robust, use systemd/cgroups CPUQuota.

If traffic or DDoS suspected. Inspect connection counts:

ss -Htn state established '( sport = :http or sport = :https )' | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head

Apply immediate IP blocks / rate-limits (short term): use firewall rules (nftables/iptables) or a WAF/CDN like Cloudflare to absorb volumetric attacks. Cloud/CDN rate limiting and WAF rules are primary defenses for large DDoS.

If VM steal time is high

Contact host provider / migrate VM. High %st means physical host overcommit; increasing vCPU or moving physical host is the fix.

Step 4 — Collect forensic evidence

Snapshot CPU profile for the process (sampling profiler):

For native C/C++/Go/Python/Java processes: use perf to see where CPU cycles go:

sudo perf top -p <PID>          # live, quick look
sudo perf record -F 99 -p <PID> -g -- sleep 60
sudo perf script > out.perf

# convert to flamegraph (FlameGraph scripts from Brendan Gregg)

Flame graphs expose hot call paths and are the fastest way to find the hotspots that matter.

For interpreted languages:

PHP: enable FPM slowlog and check backtraces of slow requests (request_slowlog_timeout + slowlog). The PHP manual covers FPM pool directives. Use slowlog before killing processes so we can fix root cause.

Example FPM pool config:

request_slowlog_timeout = 5s
slowlog = /var/log/php-fpm/www-slow.log

Node.js: use 0x/clinic or node --inspect flamegraphs.

Java: jstack, jcmd, and profilers (async-profiler).

When a process is spinning with syscalls: trace it (non-invasive):

sudo strace -ttt -p <PID> -o /tmp/strace.pid.log

# watch for repeated syscalls (poll, futex, write, accept)

strace points to blocking syscalls and can quickly reveal tight loops or repeated network calls.

DB servers: check DB processlists and slow logs (MySQL SHOW PROCESSLIST;, Postgres pg_stat_activity) — sometimes CPU is from expensive queries.

Step 5 — Interpret the evidence, common causes and fixes

Match the symptom to likely causes and the pragmatic fix:

Single interpreted process high %user

Likely: inefficient code, infinite loop, expensive regex, unbounded CPU work.
Fix: profile (flamegraph), patch the hotspot, add caching, add timeouts, restart service if urgent.

Many worker processes each consuming CPU

Likely: legitimate traffic spike, misconfigured worker counts, cache miss storm.
Fix: tune web server/PHP-FPM worker counts to match CPU cores and memory, enable opcache (PHP), add object/page cache, put static assets behind CDN.

High %system or interrupts

Likely: network storm, driver issues, kernel processing (e.g., heavy packet processing).
Fix: offload to firewall/CDN, update NIC drivers, enable hardware interrupts balancing (irqbalance).

High %iowait

Likely: saturated disk IO causing processes to wait (looks like low CPU but system sluggish).
Fix: inspect iostat -x, iotop. Move DB to faster storage, add RAID, increase IOPS, or tune queries/indices.

High steal time (%st)

Likely: host oversubscription in virtualized environment; not our code.
Fix: migrate VM, increase instance size, or contact provider.

DDoS / abusive traffic

Likely: many connections, spike in SYNs, or HTTP floods.
Fix: enable CDN/WAF rate limits, block offending IPs temporarily, scale out, and implement long-term WAF rules.

Step 6 — Short-term containment commands (examples)

Graceful restart of web service (if stuck):

sudo systemctl restart php-fpm
sudo systemctl restart nginx

Pause or limit CPU while keeping service alive:

  • sudo systemd-run --scope -p CPUQuota=40% -p MemoryMax=500M <command>
  • # or set property for an existing service:
  • sudo systemctl set-property php-fpm.service CPUQuota=50%

Kill runaway immediately (only when necessary):

sudo kill -TERM <PID>
sleep 5; sudo kill -KILL <PID>   # escalate if it refuses to die

Block heavy attackers (one-liner; be careful):

sudo iptables -I INPUT -s 1.2.3.4 -j DROP

For web-scale attacks, prefer upstream mitigations (CDN/WAF) over local firewall rules.

Step 7 — Long-term fixes (prevent recurrence)

Profiling and code changes

Use perf/flamegraphs or language profilers to remove hotspots. Flamegraphs are the standard visual tool for CPU hotspots.

Right-size service concurrency

Tune PHP-FPM pm.* settings, Nginx worker_processes/worker_connections, database connection pools to match CPU and memory resources.

Add caching

Page cache (Varnish), object cache (Redis/Memcached), opcode cache (PHP OPcache). Caching reduces CPU by avoiding repeated work.

Observability and alerts

Export CPU metrics (node_exporter), per-process metrics, and application traces to Prometheus/Grafana or a managed APM (New Relic, Datadog, etc.). Alert on anomalous CPU, load, and %st.

Safeguards

Use systemd resource controls or cgroups to cap known non-critical processes. Use autoscaling for stateless tiers and capacity planning for DB/Stateful tiers.

DDoS readiness

Put public sites behind a CDN/WAF, implement rate limiting, and have an incident playbook. Cloud/CDN rate limiting is often the difference between a short incident and full outage.

Tools at our disposal (cheat sheet)

  • Monitoring/metrics: top, htop, mpstat, pidstat, iostat, netstat/ss, sar
  • Profiling: perf, Brendan Gregg’s FlameGraph scripts.
  • Tracing: strace, ltrace, bcc/eBPF tools for advanced tracing
  • Language profilers: PHP-FPM slowlog / xdebug / Blackfire, py-spy for Python, jstack/jcmd for Java, clinic for Node.js
  • Containment: renice, cpulimit, systemd-run/CPUQuota, firewall (iptables/nft), CDN/WAF

Final checklist (what we do, in order)

  • Triage: uptime, top, ps (90s).
  • Confirm cause class: CPU user vs system vs iowait vs steal.
  • Identify process(es) with ps, pidstat, docker stats.
  • Short term: renice/limit/restart or CDN block.
  • Forensic: perf / flamegraph / slowlogs / strace / DB slow query logs.
  • Fix: code/profile/tune configs/caching/scale.
  • Prevent: monitoring, rate limits, WAF/CDN, resource quotas (systemd/cgroups).

Check out robust instant dedicated serversInstant KVM VPS, premium shared hosting and data center services in New Zealand