Monitoring & Alerting System Overview
What is the Monitoring System?
The Sampo Monitoring & Alerting System provides real-time visibility into application health, performance, and business metrics with automated notifications when issues are detected.
Architecture Components
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ NestJS API │────▶│ Prometheus │────▶│ Alertmanager│
│ /metrics │ │ (Metrics) │ │ (Alerts) │
└─────────────┘ └──────────────┘ └─────────────┘
│ │ │
│ ▼ ▼
│ ┌─────────────┐ ┌─────────────┐
│ │ Grafana │ │ Email │
└─────────────▶│ (Dashboards)│ │(Notifications)│
└─────────────┘ └─────────────┘
Components:
- Prometheus - Collects metrics every 60 seconds from API and exporters
- Alertmanager - Evaluates alert rules and sends notifications
- Grafana - Visualizes metrics in dashboards (health + business)
- Exporters - Collect system, database, and Redis metrics
Key Features
Automated Alert Monitoring (25+ Alerts)
Application Health:
- API availability monitoring (1-minute detection)
- Database connection health
- Redis/queue health
- Memory usage tracking
Performance Monitoring:
- Slow database queries (>100ms)
- High API response times (>2s)
- Error rate tracking (5xx errors)
Infrastructure Monitoring:
- Memory usage (warning at 70%, critical at 90%)
- CPU usage (warning at 80%)
- Container restart detection
Business Metrics:
- Submission tracking (received, converted, rejected)
- Conversion rate monitoring
- Performance degradation detection
Notification System
Email Notifications (Active):
- HTML-formatted alerts with severity color coding
- Alert grouping (multiple similar alerts → 1 email)
- Resolution notifications (when issues resolve)
- Critical alerts sent immediately (10s group wait)
- Warning alerts batched (1m group wait)
Severity Levels:
| Severity | Response Time | Examples | | -------- | ------------- | ------------------------------------------- | | CRITICAL | Minutes | API down, database failing, high error rate | | WARNING | Hours | Slow queries, high memory, no submissions |
Ready for Expansion:
- Slack integration (commented out, ready to enable)
- PagerDuty integration (commented out, ready to enable)
Dashboards
Health Monitoring Dashboard (10 Panels)
Access: Grafana → "Sampo Health Monitoring - BlueLine Alpha"
Panels:
- Overall Health Status - Real-time health check status
- Database Response Time - p50, p95, p99 percentiles
- Memory Usage - Percentage with thresholds
- Database Health - Connection status and latency
- Redis Health - Queue system status
- HTTP Request Rate - Requests per second by route
- HTTP Error Rate - 4xx and 5xx errors
- API Response Time - Request duration by percentile
- Health Check Duration - Component check performance
- Component Status - Database, Redis, Memory health
Business Metrics Dashboard (9 Panels)
Access: Grafana → "Sampo Business Metrics - BlueLine Alpha"
Panels:
- Submissions Received - Last 24h total
- Conversion Rate - Percentage gauge (target: >60%)
- Rejection Rate - Percentage gauge (target: <20%)
- Submissions Reconciled - Order processing count
- Submissions Timeline - Received/converted/rejected trends
- Conversion Duration - p50, p90, p99 processing time
- Submissions by Source - Pie chart breakdown
- Conversion Funnel - Visual pipeline (received → converted)
- Submissions by Deployment - Multi-deployment breakdown
Alert Examples
Critical Alert: API Down
Trigger: API unreachable for more than 1 minute
Email Notification:
Subject: [FIRING:1] APIDown BlueLine Alpha
Alert: APIDown
Severity: CRITICAL
Description: API endpoint has been unreachable for more than 1 minute
Deployment: blueline
Instance: blueline-alpha-api:3001
Started: 2026-02-08 14:23:15 UTC
This requires immediate attention. Service is completely unavailable.
Expected Action: Immediate investigation (see runbook in alert email)
Warning Alert: High Memory Usage
Trigger: Memory usage >70% for more than 10 minutes
Email Notification:
Subject: [FIRING:1] MemoryUsageWarning BlueLine Alpha
Alert: MemoryUsageWarning
Severity: WARNING
Description: Memory usage has exceeded 70% for more than 10 minutes
Deployment: blueline
Current Usage: 74.5%
Started: 2026-02-08 14:30:00 UTC
Monitor for increasing trend. May need investigation or restart.
Expected Action: Monitor trend, investigate if climbing toward 90%
Accessing Monitoring Tools
Grafana Dashboards
URL: http://localhost:3004 (local) or http://<VPS-IP>:3004 (production)
Login:
- Username:
admin - Password: Check
GRAFANA_ADMIN_PASSWORDin deployment.envfile
Navigation:
- Log in to Grafana
- Click "Dashboards" (left sidebar)
- Select dashboard:
- "Sampo Health Monitoring - BlueLine Alpha"
- "Sampo Business Metrics - BlueLine Alpha"
Prometheus (Advanced Users)
URL: http://localhost:9090 (requires SSH tunnel for production)
SSH Tunnel Setup:
ssh -L 9090:localhost:9090 root@<VPS-IP>
# Then access: http://localhost:9090
Use Cases:
- Custom metric queries (PromQL)
- Alert rule inspection
- Target health verification
Alertmanager (Advanced Users)
URL: http://localhost:9093 (requires SSH tunnel for production)
SSH Tunnel Setup:
ssh -L 9093:localhost:9093 root@<VPS-IP>
# Then access: http://localhost:9093
Use Cases:
- View active alerts
- Silence alerts temporarily
- View alert history
- Test notification receivers
Alert Grouping & Deduplication
How It Works
Alerts with the same alertname, deployment, and severity are grouped into a single notification to prevent spam.
Example:
If 5 containers restart simultaneously, you receive 1 email listing all 5 restarts, not 5 separate emails.
Grouping Configuration
- Group Wait: Time before sending first notification
- Critical: 10 seconds
- Warning: 1 minute
- Repeat Interval: Time before re-sending if alert still firing
- Critical: 30 minutes
- Warning: 2 hours
Inhibition Rules
What are inhibition rules?
Rules that suppress lower-severity alerts when higher-severity alerts are already firing for the same deployment.
Example:
If both HighMemoryUsage (CRITICAL, >90%) and MemoryUsageWarning (WARNING,
70%) fire simultaneously, only the CRITICAL alert is sent.
Benefit: Reduces alert fatigue by focusing on the most urgent issues.
Common Scenarios
Scenario 1: Deployment Causes Alert
What Happens:
- Deploy new code version
- Container restarts (expected)
ContainerRestartedWARNING alert fires- Email notification received
Is This Normal?
Yes - Container restarts during deployments are expected. Alert resolves automatically after 5 minutes of uptime.
Action Required: None (unless container keeps restarting)
Scenario 2: Slow Database Queries Alert
What Happens:
- Database queries slow down (p99 >100ms)
SlowDatabaseQueriesWARNING alert fires after 5 minutes- Email notification received
Action Required:
- Check Grafana "Database Response Time" panel
- Identify affected queries in API logs
- Investigate missing indexes or query optimization
- If p99 >500ms for >15 minutes, escalate to backend team
Scenario 3: No Submissions Received
What Happens:
- Zero submissions received in last hour
NoSubmissionsReceivedWARNING alert fires- Email notification received
Action Required:
- Check if it's business hours (alert may be expected overnight)
- Verify external integration status (webhook endpoints)
- Check API logs for submission endpoint errors
- Test submission endpoint manually
- If >4 hours during business hours, escalate to integration team
Metrics Collection
Application Metrics (from NestJS API)
Endpoint: http://localhost:3003/metrics
Metrics Collected:
- HTTP requests (total, duration, by route/method/status)
- Health check status (database, redis, memory, overall)
- Database response time
- Memory usage percentage
- Submission business metrics (received, converted, rejected, reconciled)
Collection Frequency: Every 60 seconds
System Metrics (from Node Exporter)
Endpoint: http://localhost:9100/metrics
Metrics Collected:
- CPU usage by mode (user, system, idle)
- Memory statistics (used, available, cached)
- Disk I/O and usage
- Network statistics (bytes sent/received)
- Filesystem usage
Collection Frequency: Every 60 seconds
Database Metrics (from PostgreSQL Exporter)
Endpoint: http://localhost:9187/metrics
Metrics Collected:
- Active database connections
- Deadlock count
- Rows inserted/updated/deleted
- Block reads/writes
- Query statistics
Collection Frequency: Every 60 seconds
Redis Metrics (from Redis Exporter)
Endpoint: http://localhost:9121/metrics
Metrics Collected:
- Connected clients
- Memory usage
- Evicted keys (cache pressure)
- Command counts by type
- Keyspace statistics
Collection Frequency: Every 60 seconds
Performance Impact
Resource Usage (monitoring stack):
- Prometheus: ~200MB memory, <5% CPU
- Alertmanager: ~20MB memory, <1% CPU
- Grafana: ~100MB memory, <2% CPU
- Exporters (3 total): ~65MB memory, <4% CPU
Total Overhead: ~385MB memory, <12% CPU (acceptable for production)
Storage:
- Prometheus data: ~6.5GB/month with 30-day retention
- Alertmanager data: ~10MB (alert state)
Next Steps
- Learn Alert Responses → See Alert Response Guide
- Configure Email Notifications → See Monitoring Configuration
- View Dashboards → Access Grafana at
http://localhost:3004 - Explore Metrics → Access Prometheus at
http://localhost:9090(via SSH tunnel)
Related Articles
- Alert Response Guide - How to respond to common alerts
- Monitoring Configuration - SMTP setup and notification channels
- Deployment System Overview - Deployment resilience features
- Diagnostics Dashboard - Additional health monitoring tools
Support
For monitoring system issues or questions:
- Documentation:
docs/operations/alerting-guide.md(complete reference) - Alert Runbooks: Included in alert notification emails
- Technical Support: Contact DevOps team
Last Updated: 2026-02-08
System Version: Week 3+ (Alertmanager + Exporters + Business Dashboards)