Monitoring Lab

Challenges

Learn to interpret system health indicators.

○ Identify all metric cards

○ Understand metric units

○ Read trend indicators

○ Observe sparkline patterns

○ Find the highest value metric

Interpret alert states, severity levels, and thresholds.

○ Read alert severity levels

○ Identify threshold vs current values

○ Understand alert targets

○ Compare warning vs info alerts

○ Predict which metrics might trigger alerts

Build and run PromQL queries to aggregate metric data.

○ Build an avg() query

○ Compare aggregation functions

○ Change time ranges and observe effects

○ Understand [5m] range vector meaning

○ Interpret query results

Read and correlate dashboard panels to assess system state.

○ Read the CPU gauge

○ Interpret the memory chart panel

○ Understand RX/TX network values

○ Correlate metrics to overall health

○ Identify problem indicators in dashboard

Design effective alert rules with thresholds and conditions.

○ Define a CPU threshold alert

○ Create a compound condition alert

○ Set error rate thresholds

○ Design a memory pressure alert

○ Configure notification channels

Correlate metrics across dimensions for root cause analysis.

○ Relate CPU usage to disk I/O

○ Correlate request rate with error rate

○ Explain memory vs disk relationship

○ Predict cascade failures

○ Build a composite health score

Architect monitoring stacks and understand the three pillars.

○ Compare metrics vs logs vs traces

○ Design a monitoring stack

○ Understand Prometheus scrape model

○ Plan a Grafana dashboard hierarchy

○ Explain service mesh observability

Build incident response processes, runbooks, and SLO definitions.

○ Build an on-call runbook

○ Design escalation paths

○ Create SLI/SLO definitions

○ Implement error budgets

○ Practice incident timeline construction

0% Complete

Live Metrics (Simulated)

cpu usage percent

45 %

→ stable

memory used gb

12.4 GB

↑ up

disk io mbps

156 MB/s

↓ down

network rx mbps

89 Mbps

↑ up

requests per second

1247 req/s

→ stable

error rate percent

0.12 %

↓ down

PromQL Query Builder

Metric

Aggregation

Time Range

avg(cpu_usage_percent[5m])

Run a query to see results

Active Alerts

WARNING

High Memory Usage

drone-Izar

Threshold: 85%

Current: 82%

INFO

Build Queue Growing

Orchestrator

Threshold: 50

Current: 47

Dashboard Preview

System Overview

45%

CPU Usage

Memory

Network I/O

RX89 Mbps

TX45 Mbps

Observability Concepts

📊 Metrics

Numerical measurements collected at regular intervals. Time-series data for tracking system health.

cpu_usage{host="drone-Izar"} 0.45

📝 Logs

Timestamped records of discrete events. Essential for debugging and audit trails.

[2026-02-02 12:00:00] INFO: Build completed

🔗 Traces

Request flow through distributed systems. Shows latency and dependencies.

trace_id=abc123 span_id=def456

Challenges

What You Can Do

Learning Objectives

Resources

Live Metrics (Simulated)

PromQL Query Builder

Active Alerts

Dashboard Preview

Observability Concepts

📊 Metrics

📝 Logs

🔗 Traces

System Status

🌐 Gateway

🚀 Orchestrators

🤖 Build Drones

🔨 Active Builds