Go back

Infrastructure Monitoring with Prometheus and Grafana: Building Production-Ready Observability in 20

By Raman Kumar

Why Infrastructure Monitoring Matters More Than Ever

Production systems fail in creative ways. A database connection pool exhausts itself. Memory usage slowly creeps upward over weeks. Network latency spikes during peak hours. Without proper infrastructure monitoring, you discover these issues when users start complaining.

Modern applications demand observability that goes beyond simple uptime checks. You need metrics that reveal performance trends, resource utilization patterns, and early warning signs of system stress. Hostperl VPS hosting provides the foundation, but the monitoring stack determines how quickly you spot and resolve issues.

The Prometheus and Grafana combination has become the standard for infrastructure monitoring. Prometheus collects and stores time-series metrics. Grafana visualizes them through customizable dashboards. Together, they create a powerful observability platform that scales from single servers to complex distributed systems.

Setting Up Prometheus for Metrics Collection

Prometheus operates on a pull model, scraping metrics from configured targets at regular intervals. This approach scales better than push-based systems because Prometheus controls the collection rate and can detect when services become unavailable.

Start by creating a dedicated monitoring user:

sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus

Download and install Prometheus:

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo cp prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

Create the main configuration file at /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

This configuration scrapes Prometheus itself and a Node Exporter instance every 15 seconds. The short interval provides granular data for performance analysis.

Deploying Node Exporter for System Metrics

Node Exporter collects hardware and operating system metrics from Unix systems. It exports CPU usage, memory consumption, disk I/O, network statistics, and filesystem utilization.

Install Node Exporter on each system you want to monitor:

wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvf node_exporter-1.6.1.linux-amd64.tar.gz
sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/node_exporter

Create a systemd service file at /etc/systemd/system/node_exporter.service:

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Enable and start the services:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter prometheus
sudo systemctl start node_exporter prometheus

Node Exporter provides hundreds of metrics out of the box. Key metrics include node_cpu_seconds_total for CPU usage, node_memory_MemAvailable_bytes for available memory, and node_filesystem_avail_bytes for disk space.

Installing and Configuring Grafana

Grafana transforms raw metrics into visual dashboards that reveal trends and anomalies at a glance. Its query builder makes it accessible to team members who don't write PromQL regularly.

Add the Grafana repository and install:

sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install grafana

Configure Grafana in /etc/grafana/grafana.ini. Key settings include the HTTP port, security options, and database configuration:

[server]
http_port = 3000
domain = your-domain.com
root_url = https://your-domain.com/

[security]
secret_key = your-secret-key-here
admin_user = admin
admin_password = secure-password

[database]
type = sqlite3
path = grafana.db

Start Grafana and enable it to start on boot:

sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Access Grafana at http://your-server:3000 and log in with your admin credentials. The first task is adding Prometheus as a data source.

Creating Effective Infrastructure Dashboards

Effective dashboards tell a story about system health. They should answer common questions: Is the system overloaded? Are we running out of resources? What changed recently?

Start with a system overview dashboard that displays:

CPU utilization across all cores
Memory usage and available memory
Disk I/O operations per second
Network throughput
Load average

Use the following PromQL queries for key metrics:

# CPU Usage Percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage Percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk Usage Percentage
(1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100

# Network Receive Rate
rate(node_network_receive_bytes_total[5m])

Organize panels logically. Place the most critical metrics at the top. Use consistent time ranges and refresh intervals across related dashboards.

For database servers, add panels for connection counts, query execution time, and cache hit rates. Web servers need request rates, response times, and error rates. The production monitoring stack implementation approach helps structure these dashboards systematically.

Implementing Alerting Rules

Dashboards show you what's happening now. Alerting rules notify you when conditions require attention. Prometheus evaluates alerting rules at regular intervals and sends notifications through Alertmanager.

Create a rules directory and define alerts in /etc/prometheus/rules/alerts.yml:

groups:
  - name: infrastructure.rules
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is {{ $value }}% for more than 5 minutes"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is {{ $value }}% for more than 3 minutes"

      - alert: DiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 90
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critically low"
          description: "Disk usage is {{ $value }}% on {{ $labels.device }}"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute"

The for clause prevents transient spikes from triggering alerts. Set different thresholds based on your infrastructure's normal operating ranges.

Install and configure Alertmanager to handle notifications:

wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xvf alertmanager-0.25.0.linux-amd64.tar.gz
sudo cp alertmanager-0.25.0.linux-amd64/alertmanager /usr/local/bin/
sudo mkdir /etc/alertmanager
sudo chown prometheus:prometheus /usr/local/bin/alertmanager /etc/alertmanager

Configure Alertmanager in /etc/alertmanager/alertmanager.yml to send notifications via email, Slack, or other channels.

Monitoring Application-Specific Metrics

System metrics tell part of the story. Application metrics provide context about business logic, user behavior, and performance bottlenecks specific to your software.

Most modern applications can expose metrics in Prometheus format. Popular options include:

MySQL Exporter for database performance metrics
Nginx VTS Exporter for web server statistics
Redis Exporter for cache performance
Custom application metrics via client libraries

For web applications, track request duration, error rates, and throughput. These metrics often reveal performance issues before system metrics show stress.

Database monitoring should include connection pool usage, slow query counts, and replication lag. The advanced Redis performance tuning strategies complement infrastructure monitoring by providing application-level insights.

Scaling Monitoring Infrastructure

As your infrastructure grows, monitoring itself becomes a scalability challenge. Prometheus stores metrics locally, which limits retention and creates single points of failure.

Federation allows multiple Prometheus instances to scrape metrics from each other. Configure a global Prometheus instance to collect high-level metrics from regional instances.

Remote storage backends like Thanos or Cortex provide long-term storage and high availability. They preserve historical data beyond Prometheus's default 15-day retention.

Service discovery integration automatically updates scrape targets as your infrastructure changes. Kubernetes, Consul, and EC2 integrations reduce manual configuration overhead.

The system monitoring strategy framework provides guidance on architecting monitoring systems that grow with your infrastructure.

Building solid infrastructure monitoring requires reliable hosting that won't introduce performance variables of its own. Hostperl's VPS hosting solutions provide consistent performance baselines for accurate monitoring, while our dedicated server hosting offers the resources needed for comprehensive observability stacks.

Frequently Asked Questions

How much storage does Prometheus require for metrics?

Storage requirements depend on the number of time series and retention period. A typical server with 1000 active time series uses approximately 1-3 GB per day. Plan for 2x your expected usage to account for metric growth.

What's the optimal scrape interval for infrastructure monitoring?

15 seconds provides good granularity for most use cases without overwhelming the system. Increase to 30-60 seconds for less critical metrics or when monitoring hundreds of targets.

How do I prevent alert fatigue from too many notifications?

Group related alerts, use appropriate severity levels, and implement escalation policies. Set reasonable thresholds based on your infrastructure's normal behavior patterns rather than arbitrary percentages.

Can Prometheus monitor Windows servers?

Yes, the WMI Exporter provides Windows metrics in Prometheus format. It exports CPU, memory, disk, and network statistics similar to Node Exporter for Linux systems.

How do I secure Prometheus and Grafana in production?

Enable authentication, use TLS encryption, restrict network access with firewalls, and regularly update to the latest versions. Consider placing monitoring infrastructure behind a VPN or bastion host.

Compute

Infrastructure

Applications

Infrastructure Monitoring with Prometheus and Grafana: Building Production-Ready Observability in 20

By Raman Kumar

Updated on Apr 21, 2026

Why Infrastructure Monitoring Matters More Than Ever

Setting Up Prometheus for Metrics Collection

Deploying Node Exporter for System Metrics

Installing and Configuring Grafana

Creating Effective Infrastructure Dashboards

Implementing Alerting Rules

Monitoring Application-Specific Metrics

Scaling Monitoring Infrastructure

Frequently Asked Questions

How much storage does Prometheus require for metrics?

What's the optimal scrape interval for infrastructure monitoring?

How do I prevent alert fatigue from too many notifications?

Can Prometheus monitor Windows servers?

How do I secure Prometheus and Grafana in production?

Featured Category

Infrastructure

Web Hosting

AI and ML

Programming

Linux

Website

Security

Latest Chapters

Shared Hosting vs VPS for Email Deliverability in 2026

Shared Hosting vs VPS for Email: What Works in 2026

cPanel vs DirectAdmin for New Hosting Customers in 2026

How to Choose Between Shared Hosting, VPS, and Dedicated

cPanel vs Plesk: Pick the Right Panel in 2026