Monitoring Lab
Build dashboards and explore metrics
Challenges
Learn to interpret system health indicators.
○ Identify all metric cards
- Scroll through the metrics grid section
- Read the name and current value of each metric card
- Note the units displayed (%, MB, req/s, etc.)
Examine the metrics grid at the top of the dashboard 💡 Understanding what each metric measures is the foundation of monitoring. CPU, memory, disk, and network are the four golden signals of system health.
○ Understand metric units
- Look at the unit label next to each metric value
- Identify which metrics use percentages vs absolute values
- Note the difference between MB/s, Mbps, and req/s
Compare the units across all metric cards 💡 Units matter. MB/s is megabytes per second (disk throughput), Mbps is megabits per second (network bandwidth), and req/s is requests per second (application load).
○ Read trend indicators
- Find the trend arrow on each metric card
- Identify which metrics are trending up, down, or stable
- Consider whether each trend direction is good or bad
Check the trend arrows on every metric card 💡 A trend direction is not inherently good or bad. CPU trending up means more load, which may be fine during a deploy. Error rate trending up is almost always concerning.
○ Observe sparkline patterns
- Look at the small chart at the bottom of each metric card
- Notice if the line is smooth, spiky, or flat
- Compare sparkline shapes across different metrics
Study the sparkline mini-charts on each metric card 💡 Sparklines show recent history at a glance. Spiky patterns suggest variable load, smooth lines suggest steady state, and sudden changes may indicate incidents.
○ Find the highest value metric
- Compare the numerical values across all metric cards
- Identify which metric has the largest absolute number
- Consider whether absolute value or percentage is more meaningful
Compare all metric values to find the highest one 💡 requests_per_second will likely have the highest raw number, but cpu_usage_percent at 90% is more alarming than 1247 req/s. Context matters more than magnitude.
Interpret alert states, severity levels, and thresholds.
○ Read alert severity levels
- Look at the Active Alerts section
- Identify the severity badge on each alert (warning, info, critical)
- Notice the color coding for each severity level
Examine the alert severity badges and their colors 💡 Alert severity follows a standard hierarchy: info (blue) for awareness, warning (yellow) for potential issues, critical (red) for immediate action required.
○ Identify threshold vs current values
- Find the threshold and current values on each alert card
- Calculate how close the current value is to the threshold
- Determine which alerts are closest to triggering
Compare threshold and current values on each alert 💡 The gap between threshold and current tells you how much headroom you have. A current value at 96% of the threshold means you are very close to firing.
○ Understand alert targets
- Read the target field on each alert
- Identify what system or service each alert is monitoring
- Consider why different targets have different alert types
Check which systems the alerts are targeting 💡 Alert targets identify the specific host, service, or component being monitored. drone-Izar is a build worker while Orchestrator manages the build queue.
○ Compare warning vs info alerts
- Find one warning alert and one info alert
- Compare their urgency and required response
- Think about who should be notified for each type
Contrast the warning and info severity alerts 💡 Info alerts are informational and may not need action. Warning alerts suggest a problem is developing. The response playbook differs significantly between them.
○ Predict which metrics might trigger alerts
- Review the current metric values in the grid
- Look at the alert thresholds defined
- Estimate which metrics could cross their thresholds soon based on trends
Cross-reference metric trends with alert thresholds 💡 Metrics trending upward toward a threshold are pre-alert indicators. Combining trend direction with the gap to threshold lets you predict future alerts.
Build and run PromQL queries to aggregate metric data.
○ Build an avg() query
- Open the PromQL Query Builder section
- Select a metric from the dropdown
- Set the aggregation to avg() and choose a time range
- Click Run Query and observe the result
Use the query builder to create: avg(cpu_usage_percent[5m]) 💡 avg() computes the arithmetic mean of samples over the time window. For cpu_usage_percent[5m], it averages all CPU samples from the last 5 minutes.
○ Compare aggregation functions
- Run the same metric with avg(), then switch to max()
- Run again with min() and then sum()
- Compare how each aggregation changes the result
Run the same metric with all four aggregation types 💡 avg shows typical behavior, max shows peaks (useful for capacity), min shows valleys, and sum is used when combining values across multiple instances.
○ Change time ranges and observe effects
- Run a query with a 5-minute range
- Change to 15 minutes and run again
- Try 1 hour and 6 hours and note how results differ
Run the same query across all four time ranges 💡 Shorter ranges show recent behavior and are more reactive. Longer ranges smooth out spikes and show trends. Choose based on whether you need alerting (short) or capacity planning (long).
○ Understand [5m] range vector meaning
- Look at the query preview showing the [5m] notation
- Consider what happens if you change it to [1h]
- Think about how many data points are included in each range
Study the range vector notation in the query preview 💡 The [5m] is a range vector selector. It tells Prometheus to look back 5 minutes from now and return all samples in that window. More time means more samples to aggregate.
○ Interpret query results
- Run a query and look at the result panel
- Read the numerical result value
- Note the execution time and sample count
- Consider if the result matches what the metric cards show
Run a query and analyze every field in the result 💡 The result value is your aggregated metric. Execution time tells you query cost. Sample count indicates data density. Live vs simulated source affects accuracy.
Read and correlate dashboard panels to assess system state.
○ Read the CPU gauge
- Find the System Overview panel in the Dashboard Preview
- Read the CPU percentage from the gauge
- Note the fill level of the arc relative to the number
Examine the CPU gauge in the Dashboard Preview section 💡 Gauges map a value to an arc or dial. The fill proportion should match the percentage. A half-filled gauge at 50% means linear mapping between value and visual.
○ Interpret the memory chart panel
- Find the Memory panel in the Dashboard Preview
- Observe what visualization type is used
- Consider what a bar chart vs line chart tells you about memory
Study the Memory panel in the dashboard 💡 Memory panels often show current usage as a proportion of total. Bar charts are good for comparing categories (used vs cached vs free), while line charts show changes over time.
○ Understand RX/TX network values
- Find the Network I/O panel
- Read the RX (receive) and TX (transmit) values
- Consider what the ratio between them tells you about traffic patterns
Read the Network I/O panel RX and TX values 💡 RX is incoming traffic (downloads, responses) and TX is outgoing (uploads, requests). A server typically has higher TX than RX. A client is the opposite.
○ Correlate metrics to overall health
- Review CPU, memory, and network values together
- Assess whether the system appears healthy, stressed, or failing
- Identify which single metric would concern you most
Look at all dashboard panels together and assess system health 💡 No single metric tells the full story. High CPU with low memory is different from high CPU with high memory. Correlating metrics gives you the real picture of system health.
○ Identify problem indicators in dashboard
- Look for any values that seem unusually high or low
- Check if any trends are moving in a concerning direction
- Consider what combination of values would indicate an incident
Scan all dashboard panels for potential problems 💡 Problem indicators include: CPU consistently above 80%, memory approaching limits, error rate climbing, and network saturation. Multiple simultaneous issues are more concerning than isolated ones.
Design effective alert rules with thresholds and conditions.
○ Define a CPU threshold alert
- Decide what CPU percentage should trigger a warning
- Choose an appropriate evaluation window (e.g., 5 minutes)
- Consider whether you want instantaneous or sustained thresholds
Design a PromQL alert: avg(cpu_usage_percent[5m]) > 80 💡 Good CPU alerts use a time window to avoid flapping on brief spikes. avg(cpu_usage_percent[5m]) > 80 means CPU must average above 80% for 5 minutes straight before firing.
○ Create a compound condition alert
- Think of two metrics that together indicate a problem
- Write a condition that requires both to be true
- Consider: high CPU AND high memory is worse than either alone
Design: avg(cpu_usage_percent[5m]) > 70 AND avg(memory_used_gb[5m]) > 14 💡 Compound alerts reduce false positives. High CPU alone might be a build running. High CPU plus high memory plus rising error rate strongly suggests a real problem.
○ Set error rate thresholds
- Look at the current error_rate_percent value
- Decide what error rate justifies a warning vs a critical alert
- Consider using different thresholds for different time windows
Design tiered alerts: warning at 1% errors, critical at 5% 💡 Error rate thresholds depend on your SLO. If you promise 99.9% success, anything above 0.1% error rate is an SLO violation. Tiered alerts let you escalate gradually.
○ Design a memory pressure alert
- Determine total available memory for the system
- Calculate what percentage usage should trigger concern
- Factor in memory that might be cached vs truly used
Design: (memory_used_gb / memory_total_gb) * 100 > 90 💡 Memory pressure alerts should account for OS caching. Linux uses free memory for disk cache, so 90% used might be healthy. Focus on available memory, not just used.
○ Configure notification channels
- List the severity levels in your alert system
- Decide which channels each severity should use (Slack, PagerDuty, email)
- Consider time-of-day routing for different severity levels
Map: info->Slack, warning->Slack+email, critical->PagerDuty 💡 Notification routing prevents alert fatigue. Info goes to a low-priority channel, warnings notify the team, and critical pages the on-call engineer. Never page for info alerts.
Correlate metrics across dimensions for root cause analysis.
○ Relate CPU usage to disk I/O
- Compare cpu_usage_percent with disk_io_mbps trends
- Consider what workloads cause both to spike
- Think about compilation, logging, or database operations
Compare CPU and disk I/O metric cards side by side 💡 CPU and disk I/O correlate during compilation (CPU-bound with disk writes), database queries (reads + processing), and log-heavy operations. Seeing both spike together points to specific workload types.
○ Correlate request rate with error rate
- Compare requests_per_second with error_rate_percent
- Check if errors increase proportionally with traffic
- Consider whether errors are constant or load-dependent
Overlay request rate and error rate trends mentally 💡 If error rate stays flat as requests increase, errors are independent of load (likely bugs). If errors scale with requests, you have a capacity or resource issue.
○ Explain memory vs disk relationship
- Consider how memory usage affects disk I/O
- Think about swap, page cache, and buffer cache
- Predict what happens when memory fills up
Analyze the relationship between memory_used_gb and disk_io_mbps 💡 When memory fills, the OS starts swapping to disk, causing disk I/O to spike and performance to degrade. High memory plus rising disk I/O is a classic swap storm indicator.
○ Predict cascade failures
- Imagine CPU hits 100% sustained
- Trace the downstream effects on other metrics
- Map the failure chain: CPU -> response time -> error rate -> queue depth
Trace what happens when CPU saturates completely 💡 CPU saturation leads to slower response times, which causes request queuing, which increases memory usage, which triggers swapping, which increases disk I/O, which worsens CPU contention. This is a cascade.
○ Build a composite health score
- Assign weights to each metric (CPU, memory, disk, errors)
- Create a formula that produces a 0-100 health score
- Define what score ranges mean: healthy, degraded, critical
Design: health = 100 - (cpu*0.3 + mem*0.25 + disk*0.15 + errors*30) 💡 Composite health scores condense multiple metrics into one number. Weight by impact: errors heavily because they directly affect users, CPU moderately, and disk less so.
Architect monitoring stacks and understand the three pillars.
○ Compare metrics vs logs vs traces
- Review the Observability Concepts section at the bottom
- List what each pillar captures that the others cannot
- Identify scenarios where you need all three together
Read the Observability Concepts cards and compare all three pillars 💡 Metrics tell you something is wrong (CPU high), logs tell you what happened (error messages), and traces tell you where it happened (which service in the chain). All three together give full observability.
○ Design a monitoring stack
- Choose tools for metrics collection (Prometheus, Datadog, etc.)
- Choose tools for log aggregation (Loki, ELK, etc.)
- Choose tools for tracing (Jaeger, Tempo, etc.)
- Plan how they integrate together
Design a stack: Prometheus + Grafana + Loki + Tempo 💡 The Grafana ecosystem (Prometheus for metrics, Loki for logs, Tempo for traces, Grafana for visualization) is a popular open-source stack with tight integration between all components.
○ Understand Prometheus scrape model
- Consider how Prometheus collects metrics (pull vs push)
- Think about scrape intervals and their tradeoffs
- Understand service discovery and target configuration
Explain how Prometheus discovers and scrapes metric targets 💡 Prometheus uses a pull model: it scrapes HTTP endpoints at regular intervals. This means services expose a /metrics endpoint and Prometheus fetches it. Pull is simpler but requires network access to targets.
○ Plan a Grafana dashboard hierarchy
- Design a top-level overview dashboard
- Plan drill-down dashboards for each service
- Consider how users navigate from high-level to detailed views
Design: Overview -> Service -> Instance -> Debug dashboard flow 💡 Good dashboard hierarchies follow a top-down pattern: fleet overview (all services), service detail (one service, all instances), instance detail (one instance, all metrics), and debug (raw data).
○ Explain service mesh observability
- Consider what a service mesh (Istio, Linkerd) provides automatically
- List the metrics you get without code changes
- Compare mesh observability to manual instrumentation
Describe what observability data a service mesh provides for free 💡 Service meshes inject sidecar proxies that automatically capture request rate, error rate, and latency (RED metrics) for every service-to-service call without any code changes.
Build incident response processes, runbooks, and SLO definitions.
○ Build an on-call runbook
- Choose one alert from the Active Alerts section
- Write step-by-step investigation instructions
- Include commands to run, dashboards to check, and escalation criteria
Write a runbook for the High Memory Usage alert 💡 Good runbooks are specific and actionable: 1) Check which process is using memory (top/htop), 2) Check if swap is active (free -h), 3) Look for memory leaks (process RSS over time), 4) Escalate if OOM killer triggered.
○ Design escalation paths
- Define three escalation levels (L1, L2, L3)
- Assign response time expectations for each level
- Determine criteria for escalating from one level to the next
Map: L1 on-call (15min) -> L2 team lead (30min) -> L3 architect (1hr) 💡 Escalation paths ensure no incident gets stuck. L1 handles known issues with runbooks. L2 handles novel problems needing deeper expertise. L3 handles architecture-level failures requiring system redesign.
○ Create SLI/SLO definitions
- Define a Service Level Indicator for request latency
- Set a Service Level Objective target (e.g., p99 < 200ms)
- Calculate what error budget this SLO allows over 30 days
Define: SLI = p99 latency, SLO = 99.9% of requests < 200ms 💡 SLIs are the metrics you measure (latency, availability, throughput). SLOs are the targets you set (99.9% under 200ms). The gap between 100% and your SLO is your error budget for deployments and experiments.
○ Implement error budgets
- Calculate the error budget from a 99.9% SLO over 30 days
- Determine how many minutes of downtime that allows
- Plan what actions to take when the budget is nearly exhausted
Calculate: 30 days * 24h * 60m * 0.001 = 43.2 minutes/month error budget 💡 A 99.9% SLO gives 43.2 minutes of downtime per month. When budget is >50% remaining, deploy freely. At 25% remaining, slow deployments. At 0%, freeze changes until the budget replenishes.
○ Practice incident timeline construction
- Imagine an alert fires at 14:00 for high CPU
- Create a timeline of events: detection, investigation, mitigation, resolution
- Include timestamps, actions taken, and who was involved
Build a timeline: 14:00 alert -> 14:05 ack -> 14:15 root cause -> 14:30 resolved 💡 Incident timelines are critical for post-mortems. Key timestamps: detection (when alert fired), acknowledgment (when human responded), mitigation (when bleeding stopped), resolution (when fully fixed), and follow-up (when post-mortem completed).
What You Can Do
- Interactive simulated metrics dashboard
- PromQL query builder and executor
- Alert rule configuration and preview
- Dashboard gauge and chart visualization
- Observability concept reference material
Learning Objectives
Beginner
- Read and understand system metrics
- Interpret alert states and severity
Intermediate
- Build PromQL queries
- Interpret dashboard panels and gauges
Advanced
- Design effective alert rules
- Correlate metrics for root cause analysis
Expert
- Architect monitoring stacks
- Build incident response processes
Live Metrics (Simulated)
cpu usage percent
45 %
→ stable
memory used gb
12.4 GB
↑ up
disk io mbps
156 MB/s
↓ down
network rx mbps
89 Mbps
↑ up
requests per second
1247 req/s
→ stable
error rate percent
0.12 %
↓ down
PromQL Query Builder
avg(cpu_usage_percent[5m]) Run a query to see results
Active Alerts
WARNING
High Memory Usage
drone-Izar
Threshold: 85%
Current: 82%
INFO
Build Queue Growing
Orchestrator
Threshold: 50
Current: 47
Dashboard Preview
System Overview
45%
CPU Usage
Memory
Network I/O
RX89 Mbps
TX45 Mbps
Observability Concepts
📊 Metrics
Numerical measurements collected at regular intervals. Time-series data for tracking system health.
cpu_usage{host="drone-Izar"} 0.45 📝 Logs
Timestamped records of discrete events. Essential for debugging and audit trails.
[2026-02-02 12:00:00] INFO: Build completed 🔗 Traces
Request flow through distributed systems. Shows latency and dependencies.
trace_id=abc123 span_id=def456