Monitoring & Alerting System Overview

Last updated: February 8, 2026

Admin Tools

Monitoring & Alerting System Overview

What is the Monitoring System?

The Sampo Monitoring & Alerting System provides real-time visibility into application health, performance, and business metrics with automated notifications when issues are detected.

Architecture Components

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  NestJS API │────▶│  Prometheus  │────▶│ Alertmanager│
│  /metrics   │     │  (Metrics)   │     │  (Alerts)   │
└─────────────┘     └──────────────┘     └─────────────┘
      │                     │                     │
      │                     ▼                     ▼
      │              ┌─────────────┐      ┌─────────────┐
      │              │   Grafana   │      │    Email    │
      └─────────────▶│ (Dashboards)│      │(Notifications)│
                     └─────────────┘      └─────────────┘

Components:

Prometheus - Collects metrics every 60 seconds from API and exporters
Alertmanager - Evaluates alert rules and sends notifications
Grafana - Visualizes metrics in dashboards (health + business)
Exporters - Collect system, database, and Redis metrics

Key Features

Automated Alert Monitoring (25+ Alerts)

Application Health:

API availability monitoring (1-minute detection)
Database connection health
Redis/queue health
Memory usage tracking

Performance Monitoring:

Slow database queries (>100ms)
High API response times (>2s)
Error rate tracking (5xx errors)

Infrastructure Monitoring:

Memory usage (warning at 70%, critical at 90%)
CPU usage (warning at 80%)
Container restart detection

Business Metrics:

Submission tracking (received, converted, rejected)
Conversion rate monitoring
Performance degradation detection

Notification System

Email Notifications (Active):

HTML-formatted alerts with severity color coding
Alert grouping (multiple similar alerts → 1 email)
Resolution notifications (when issues resolve)
Critical alerts sent immediately (10s group wait)
Warning alerts batched (1m group wait)

Severity Levels:

Severity	Response Time	Examples
CRITICAL	Minutes	API down, database failing, high error rate
WARNING	Hours	Slow queries, high memory, no submissions

Ready for Expansion:

Slack integration (commented out, ready to enable)
PagerDuty integration (commented out, ready to enable)

Dashboards

Health Monitoring Dashboard (10 Panels)

Access: Grafana → "Sampo Health Monitoring - BlueLine Alpha"

Panels:

Overall Health Status - Real-time health check status
Database Response Time - p50, p95, p99 percentiles
Memory Usage - Percentage with thresholds
Database Health - Connection status and latency
Redis Health - Queue system status
HTTP Request Rate - Requests per second by route
HTTP Error Rate - 4xx and 5xx errors
API Response Time - Request duration by percentile
Health Check Duration - Component check performance
Component Status - Database, Redis, Memory health

Business Metrics Dashboard (9 Panels)

Access: Grafana → "Sampo Business Metrics - BlueLine Alpha"

Panels:

Submissions Received - Last 24h total
Conversion Rate - Percentage gauge (target: >60%)
Rejection Rate - Percentage gauge (target: <20%)
Submissions Reconciled - Order processing count
Submissions Timeline - Received/converted/rejected trends
Conversion Duration - p50, p90, p99 processing time
Submissions by Source - Pie chart breakdown
Conversion Funnel - Visual pipeline (received → converted)
Submissions by Deployment - Multi-deployment breakdown

Alert Examples

Critical Alert: API Down

Trigger: API unreachable for more than 1 minute

Email Notification:

Subject: [FIRING:1] APIDown BlueLine Alpha

Alert: APIDown
Severity: CRITICAL
Description: API endpoint has been unreachable for more than 1 minute
Deployment: blueline
Instance: blueline-alpha-api:3001
Started: 2026-02-08 14:23:15 UTC

This requires immediate attention. Service is completely unavailable.

Expected Action: Immediate investigation (see runbook in alert email)

Warning Alert: High Memory Usage

Trigger: Memory usage >70% for more than 10 minutes

Email Notification:

Subject: [FIRING:1] MemoryUsageWarning BlueLine Alpha

Alert: MemoryUsageWarning
Severity: WARNING
Description: Memory usage has exceeded 70% for more than 10 minutes
Deployment: blueline
Current Usage: 74.5%
Started: 2026-02-08 14:30:00 UTC

Monitor for increasing trend. May need investigation or restart.

Expected Action: Monitor trend, investigate if climbing toward 90%

Accessing Monitoring Tools

Grafana Dashboards

URL: http://localhost:3004 (local) or http://<VPS-IP>:3004 (production)

Login:

Username: admin
Password: Check GRAFANA_ADMIN_PASSWORD in deployment .env file

Navigation:

Log in to Grafana
Click "Dashboards" (left sidebar)
Select dashboard:
- "Sampo Health Monitoring - BlueLine Alpha"
- "Sampo Business Metrics - BlueLine Alpha"

Prometheus (Advanced Users)

URL: http://localhost:9090 (requires SSH tunnel for production)

SSH Tunnel Setup:

ssh -L 9090:localhost:9090 root@<VPS-IP>
# Then access: http://localhost:9090

Use Cases:

Custom metric queries (PromQL)
Alert rule inspection
Target health verification

Alertmanager (Advanced Users)

URL: http://localhost:9093 (requires SSH tunnel for production)

SSH Tunnel Setup:

ssh -L 9093:localhost:9093 root@<VPS-IP>
# Then access: http://localhost:9093

Use Cases:

View active alerts
Silence alerts temporarily
View alert history
Test notification receivers

Alert Grouping & Deduplication

How It Works

Alerts with the same alertname, deployment, and severity are grouped into a single notification to prevent spam.

Example:

If 5 containers restart simultaneously, you receive 1 email listing all 5 restarts, not 5 separate emails.

Grouping Configuration

Group Wait: Time before sending first notification
- Critical: 10 seconds
- Warning: 1 minute
Repeat Interval: Time before re-sending if alert still firing
- Critical: 30 minutes
- Warning: 2 hours

Inhibition Rules

What are inhibition rules?

Rules that suppress lower-severity alerts when higher-severity alerts are already firing for the same deployment.

Example:

If both HighMemoryUsage (CRITICAL, >90%) and MemoryUsageWarning (WARNING,

70%) fire simultaneously, only the CRITICAL alert is sent.

Benefit: Reduces alert fatigue by focusing on the most urgent issues.

Common Scenarios

Scenario 1: Deployment Causes Alert

What Happens:

Deploy new code version
Container restarts (expected)
ContainerRestarted WARNING alert fires
Email notification received

Is This Normal?

Yes - Container restarts during deployments are expected. Alert resolves automatically after 5 minutes of uptime.

Action Required: None (unless container keeps restarting)

Scenario 2: Slow Database Queries Alert

What Happens:

Database queries slow down (p99 >100ms)
SlowDatabaseQueries WARNING alert fires after 5 minutes
Email notification received

Action Required:

Check Grafana "Database Response Time" panel
Identify affected queries in API logs
Investigate missing indexes or query optimization
If p99 >500ms for >15 minutes, escalate to backend team

Scenario 3: No Submissions Received

What Happens:

Zero submissions received in last hour
NoSubmissionsReceived WARNING alert fires
Email notification received

Action Required:

Check if it's business hours (alert may be expected overnight)
Verify external integration status (webhook endpoints)
Check API logs for submission endpoint errors
Test submission endpoint manually
If >4 hours during business hours, escalate to integration team

Metrics Collection

Application Metrics (from NestJS API)

Endpoint: http://localhost:3003/metrics

Metrics Collected:

HTTP requests (total, duration, by route/method/status)
Health check status (database, redis, memory, overall)
Database response time
Memory usage percentage
Submission business metrics (received, converted, rejected, reconciled)

Collection Frequency: Every 60 seconds

System Metrics (from Node Exporter)

Endpoint: http://localhost:9100/metrics

Metrics Collected:

CPU usage by mode (user, system, idle)
Memory statistics (used, available, cached)
Disk I/O and usage
Network statistics (bytes sent/received)
Filesystem usage

Collection Frequency: Every 60 seconds

Database Metrics (from PostgreSQL Exporter)

Endpoint: http://localhost:9187/metrics

Metrics Collected:

Active database connections
Deadlock count
Rows inserted/updated/deleted
Block reads/writes
Query statistics

Collection Frequency: Every 60 seconds

Redis Metrics (from Redis Exporter)

Endpoint: http://localhost:9121/metrics

Metrics Collected:

Connected clients
Memory usage
Evicted keys (cache pressure)
Command counts by type
Keyspace statistics

Collection Frequency: Every 60 seconds

Performance Impact

Resource Usage (monitoring stack):

Prometheus: ~200MB memory, <5% CPU
Alertmanager: ~20MB memory, <1% CPU
Grafana: ~100MB memory, <2% CPU
Exporters (3 total): ~65MB memory, <4% CPU

Total Overhead: ~385MB memory, <12% CPU (acceptable for production)

Storage:

Prometheus data: ~6.5GB/month with 30-day retention
Alertmanager data: ~10MB (alert state)

Next Steps

Learn Alert Responses → See Alert Response Guide
Configure Email Notifications → See Monitoring Configuration
View Dashboards → Access Grafana at http://localhost:3004
Explore Metrics → Access Prometheus at http://localhost:9090 (via SSH tunnel)

Alert Response Guide - How to respond to common alerts
Monitoring Configuration - SMTP setup and notification channels
Deployment System Overview - Deployment resilience features
Diagnostics Dashboard - Additional health monitoring tools

Support

For monitoring system issues or questions:

Documentation: docs/operations/alerting-guide.md (complete reference)
Alert Runbooks: Included in alert notification emails
Technical Support: Contact DevOps team

Last Updated: 2026-02-08
System Version: Week 3+ (Alertmanager + Exporters + Business Dashboards)