Advanced Linux Troubleshooting in Cloud

By Raman Kumar

Updated on Nov 23, 2024

In this tutorial, we'll learn advanced Linux troubleshooting in cloud server.

In a DevOps and cloud environment, Linux is a core component, and efficient troubleshooting skills are essential for ensuring reliability and uptime. This tutorial covers advanced Linux troubleshooting techniques with practical use cases, scenarios, and detailed commands. The goal is to equip you with the skills and knowledge to diagnose and resolve common and complex Linux-related issues in modern DevOps and cloud environments.

Understanding System Logs and Monitoring Tools

Logs and monitoring tools are the first place to look when encountering an issue. They provide crucial insight into what's happening on the system.

Access and Analyze Logs Using journalctl

journalctl is a command for querying and displaying logs generated by the systemd service manager. It provides detailed information about system activities, services, and user logs, making it indispensable for diagnostics.

Key Commands:

View the entire system log:

journalctl

This command displays all logs maintained by systemd. It's the primary method to see everything happening on the system.

View logs related to a specific service:

journalctl -u nginx.service

Adding the -u flag followed by the service name filters the logs specific to that service. For instance, nginx.service displays logs relevant to the Nginx web server.

Filter logs by severity level:

journalctl -p err

The -p flag followed by a priority (like err for error) limits logs to specific severity levels, helping you quickly find errors.

View the latest logs in real-time:

journalctl -f

The -f option keeps the display open and shows new log entries as they are generated, similar to tail -f.

Example: Debugging a Memory Leak

A web application is slowing down, and you suspect a memory leak. Use journalctl to inspect system logs and identify related memory warnings or errors.

journalctl -xe | grep -i memory

The -xe options provide a detailed log output with any context around errors. grep is used to filter logs that mention "memory," helping pinpoint the issue quickly.

Network Monitoring with iftop and nload

Monitoring network activity is crucial for identifying potential bottlenecks or malicious traffic.

Key Commands:

Monitor bandwidth usage by host:

sudo iftop

iftop displays real-time bandwidth usage per host. Use it to see which hosts are consuming the most bandwidth.

Monitor network traffic per interface:

nload eth0

nload provides a graphical representation of incoming and outgoing network traffic on a specified interface (eth0 in this case).

Example: Diagnosing Network Bandwidth Issues

If the server is experiencing slow network speeds, run iftop to identify which IP addresses are using excessive bandwidth.

sudo iftop -i eth0

The -i option specifies the network interface to monitor. This command is helpful to identify if a specific IP is consuming too much bandwidth, which might indicate a DoS attack or other network problems.

Networking Issues and Resolution

Networking problems can impact services and applications, making it essential to have a reliable set of tools to diagnose them.

Diagnosing Network Connectivity

Basic network tools like ping, traceroute, and mtr are foundational for troubleshooting connectivity.

Key Commands:

Check if a host is reachable:

ping example.com

ping sends ICMP packets to a target host to check if it's reachable. A successful response indicates the target is online and reachable.

Trace the route packets take to a destination:

traceroute example.com

traceroute maps the route packets take to reach a destination. It’s useful for detecting where a network failure occurs along the route.

Diagnose packet loss and latency issues:

mtr example.com

mtr combines ping and traceroute into a single diagnostic tool that provides continuous updates, showing real-time packet loss and latency changes.

Example: Troubleshooting DNS Misconfiguration

If users report that your web application is slow or unreachable, DNS issues might be the cause. Use dig to check DNS settings.

dig example.com

dig retrieves DNS information for the domain example.com. It shows you DNS server responses and helps verify that DNS records are correctly configured.

Analyzing Ports and Services with netstat and ss

These commands help investigate active connections, listening ports, and socket statistics.

Key Commands:

Display all listening ports:

netstat -tuln

-tuln shows TCP (-t) and UDP (-u) listening (-l) ports in numeric form (-n).

Analyze open sockets with ss:

ss -tulnp

ss is a modern alternative to netstat. The -tulnp options provide similar information but are faster and more detailed.

Example: Resolving Service Port Conflicts

If a web server fails to start, there may be a port conflict. Use ss to check which services are using a specific port.

sudo ss -tulnp | grep :80

This command shows all listening services on port 80, helping you identify if another service is occupying the web server’s port.

Disk I/O and Filesystem Troubleshooting

Disk I/O problems can cause significant performance degradation. Monitoring tools help identify bottlenecks.

Monitoring Disk Usage with iostat and iotop

Disk monitoring tools provide insights into disk activity and performance.

Key Commands:

Show CPU and disk I/O statistics:

iostat -xz 5

iostat reports CPU and I/O statistics. The -xz flags give extended details on disk utilization, while 5 refreshes the output every 5 seconds.

Monitor top I/O-consuming processes:

sudo iotop

iotop displays a list of processes using the most disk I/O, allowing you to identify and address high disk activity.

Example: Diagnosing High I/O Wait Times

High I/O wait times indicate the CPU is waiting on disk operations. Use iostat to analyze which devices are causing the delay.

iostat -x 1 10

The -x option provides a detailed report for each device, refreshing every second for ten iterations. This data helps locate underperforming disks.

CPU and Memory Performance Issues

Performance issues often relate to CPU or memory usage, requiring monitoring and analysis tools.

Analyzing CPU and Memory with mpstat and vmstat. These commands give a deeper understanding of CPU and memory behavior.

Key Commands:

View CPU usage per core:

mpstat -P ALL 5

mpstat displays CPU statistics per processor. The -P ALL flag shows details for each CPU core, refreshing every 5 seconds.

Display memory usage statistics:

vmstat 5

vmstat reports memory, CPU, and I/O activity. The 5 argument refreshes the output every 5 seconds.

Example: Debugging High CPU Usage

To investigate a high CPU load, use mpstat to identify which CPU cores are being overutilized.

mpstat -P ALL 1 10

This command refreshes CPU usage every second for 10 intervals. Look for consistently high values in specific cores that may indicate a misbehaving application.

Service and Application Debugging

System services and applications often require careful monitoring, especially in a cloud environment.

Investigating Systemd Services with systemctl and journalctl

systemctl and journalctl are essential for managing and troubleshooting systemd services.

Key Commands:

Check the status of a service:

systemctl status apache2.service

This command displays the status of the apache2 service, including whether it is running, stopped, or failed, along with recent log entries.

View service logs:

journalctl -u apache2.service

The -u flag filters logs to show only entries related to apache2, making it easier to identify specific service issues.

Restart a service:

sudo systemctl restart apache2.service

Restarts the apache2 service. If a configuration change was made, this command applies it without rebooting the system.

Example: Troubleshooting a Failed Service

If a service fails to start, examine its status and logs to understand why it’s not working.

systemctl status apache2.service
journalctl -u apache2.service | tail -n 20

This set of commands shows the service's current state and the last 20 log entries, helping you narrow down the problem.

Troubleshooting File Permissions and Ownership Issues

File and directory permissions can cause application errors or security vulnerabilities. Properly managing and diagnosing permission issues is crucial.

Managing File Permissions with chmod, chown, and find

Key Commands:

Change file permissions:

chmod 644 /var/www/html/index.html

chmod changes the permissions of a file. In this example, 644 sets the file to be readable and writable by the owner, and readable by the group and others. This is common for web files that need to be publicly accessible.

Change the owner of a file:

sudo chown www-data:www-data /var/www/html/index.html

chown modifies the owner and group of a file. Here, www-data is set as both the owner and group, which is common for files managed by web servers like Apache or Nginx. Find files with specific permissions:

find /var/www -type f -perm 777

This command searches for files under /var/www with 777 permissions, which grants read, write, and execute rights to everyone. Files with such permissions can be security risks.

Example: Fixing Broken Permissions in a Web Server

If a web server displays "403 Forbidden" errors, file permissions may be misconfigured. Use find to identify overly permissive files.

find /var/www -type f -perm 777 -exec chmod 644 {} \;

This command finds all files with 777 permissions and changes them to 644, a more secure setting. The -exec flag executes a command (chmod 644) on each found file.

Identifying and Fixing Memory Leaks

Memory leaks occur when a program fails to release memory, causing the system to run out of memory over time. Tools like top, htop, and valgrind help identify memory-hungry processes.

Analyzing Memory Usage with top and htop

Key Commands:

Monitor real-time system processes and memory usage:

top

top shows system processes, memory, and CPU usage. It’s a default tool for quickly identifying which processes consume the most memory.

Enhanced process monitoring with htop:

htop

htop is similar to top but provides a more user-friendly interface, with color-coding and the ability to filter and search for specific processes. It makes tracking memory leaks easier.

Example: Debugging a Process Consuming Excessive Memory

If a process is causing high memory usage, identify it using htop.

sudo htop

Look for processes with high RES (resident memory) usage. Kill or restart the offending process using the F9 key in htop.

Diagnosing Memory Leaks with valgrind

valgrind is a sophisticated tool for debugging memory leaks in applications.

Key Commands:

Run a program with memory leak detection:

valgrind --leak-check=yes ./my_application

This command runs my_application under valgrind with --leak-check=yes, which detects memory leaks, indicating the line number and the function causing the problem.

Example: Identifying Memory Leaks in a Custom Application

If a custom-built application is consuming memory over time, use valgrind to diagnose leaks.

valgrind --leak-check=full ./my_application

The --leak-check=full option provides a detailed report on memory usage, helping developers fix the issue directly in the code.

Detecting and Resolving Process Crashes

Process crashes can disrupt services, especially in cloud environments where uptime is critical. strace, gdb, and core dumps are valuable tools for debugging.

Tracing System Calls with strace

strace is a powerful diagnostic tool that tracks system calls made by a program, revealing what a process was doing when it crashed.

Key Commands:

Trace a running process:

sudo strace -p <PID>

This command attaches strace to a running process with a specific PID (Process ID). It outputs all system calls the process is making, which is helpful for diagnosing a crash.

Trace a command from the start:

strace -o output.txt ./my_application

This command runs my_application under strace, logging all system calls to output.txt for analysis.

Example: Investigating a Segmentation Fault

If an application crashes with a segmentation fault, use strace to trace its execution.

strace -o debug.log ./my_application

Check the debug.log for the last few system calls before the crash to identify where the problem occurred.

Analyzing Core Dumps with gdb

Core dumps capture the memory state of a process at the time of a crash, useful for post-mortem analysis with gdb (GNU Debugger).

Key Commands:

Enable core dumps:

ulimit -c unlimited

ulimit -c unlimited allows the system to generate core dumps without size restrictions. This is essential for analyzing application crashes.

Analyze a core dump:

gdb ./my_application core

Opens gdb to analyze the core file, which contains the memory snapshot of my_application at the crash point.

Example: Debugging a Crashed Application

If an application crashes and generates a core dump, analyze it with gdb.

gdb ./my_application core

In gdb, use the bt (backtrace) command to see the stack trace and identify the source of the crash.

Troubleshooting Storage Issues

Disk and storage problems can lead to degraded performance or data loss. Monitoring and diagnostic tools like df, du, ncdu, and lsblk are essential.

Disk Usage Analysis with df and du

Key Commands:

Check available disk space:

df -h

df -h displays disk space usage in a human-readable format, showing each partition's available space.

Analyze directory disk usage:

du -sh /var/log

du -sh shows the total size of a directory. Here, /var/log might contain large log files filling up the disk.

Interactive disk usage analysis with ncdu:

ncdu /

ncdu provides a text-based, interactive view of disk usage, allowing you to navigate through directories and identify large files.

Example: Freeing Up Disk Space

If a server is running out of disk space, analyze the usage in /var to identify large files.

du -ah /var | sort -rh | head -10

This command lists the 10 largest files in /var, helping you find files to delete or move.

Managing Storage Devices with lsblk and smartctl

lsblk and smartctl provide insights into physical disks and their health.

Key Commands:

List block devices and their partitions:

lsblk

lsblk shows information about block devices like hard drives and their partitions. It’s useful for understanding how disks are organized.

Check the health of a hard drive:

sudo smartctl -a /dev/sda

smartctl checks the health of a drive using S.M.A.R.T. data. It provides detailed information about the disk’s condition, indicating potential failures.

Example: Troubleshooting a Failing Hard Drive

If you suspect a drive is failing, use smartctl to verify its health.

sudo smartctl -a /dev/sda | grep -i "reallocated"

This command checks for reallocated sectors, which can indicate a failing drive. High numbers of reallocated sectors suggest it’s time to replace the disk.

Kernel and Driver Troubleshooting

Kernel-related issues can cause crashes, hardware incompatibility, or degraded performance. Tools like dmesg, modprobe, and lsmod help diagnose kernel problems.

Analyzing Kernel Messages with dmesg

dmesg displays the kernel’s message buffer, showing logs related to hardware and system events.

Key Commands:

View the kernel message buffer:

dmesg | less

Displays recent kernel messages. less allows you to scroll through the output for easier reading.

Filter messages by device:

dmesg | grep eth0

Filters dmesg output for messages related to eth0, useful for diagnosing network interface issues.

Example: Diagnosing a Network Interface Problem

If a network interface isn’t working, check kernel messages for errors.

dmesg | grep -i "network"

This command searches for network-related errors, potentially pointing to driver or hardware issues.

Managing Kernel Modules with lsmod and modprobe

Kernel modules are drivers that extend kernel functionality. Managing them can resolve hardware compatibility issues.

Key Commands:

List loaded kernel modules:

lsmod

Shows currently loaded kernel modules. Useful for checking if a required driver is active.

Load a kernel module:

sudo modprobe <module_name>

modprobe loads a kernel module, such as modprobe e1000e for an Intel network driver.

Remove a kernel module:

sudo modprobe -r <module_name>

Unloads a kernel module. This can resolve conflicts with faulty or redundant drivers.

Example: Loading a Missing Network Driver

If a network interface isn’t detected, manually load the required driver.

sudo modprobe e1000e

If the network interface works after loading the driver, add it to /etc/modules to load automatically at boot.

By following this guide, you can enhance your Linux troubleshooting skills, effectively handle complex issues in a DevOps or cloud environment, and maintain system reliability. Use these tools and techniques to diagnose, analyze, and resolve issues systematically. Keep this guide handy for real-time troubleshooting and refer back to the examples for practical scenarios!

Checkout our instant dedicated servers and Instant KVM VPS plans.