Monitoring & Alerting System Overview

Last updated: February 8, 2026
Admin Tools

Monitoring & Alerting System Overview

What is the Monitoring System?

The Sampo Monitoring & Alerting System provides real-time visibility into application health, performance, and business metrics with automated notifications when issues are detected.

Architecture Components

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  NestJS API │────▶│  Prometheus  │────▶│ Alertmanager│
│  /metrics   │     │  (Metrics)   │     │  (Alerts)   │
└─────────────┘     └──────────────┘     └─────────────┘
      │                     │                     │
      │                     ▼                     ▼
      │              ┌─────────────┐      ┌─────────────┐
      │              │   Grafana   │      │    Email    │
      └─────────────▶│ (Dashboards)│      │(Notifications)│
                     └─────────────┘      └─────────────┘

Components:

  • Prometheus - Collects metrics every 60 seconds from API and exporters
  • Alertmanager - Evaluates alert rules and sends notifications
  • Grafana - Visualizes metrics in dashboards (health + business)
  • Exporters - Collect system, database, and Redis metrics

Key Features

Automated Alert Monitoring (25+ Alerts)

Application Health:

  • API availability monitoring (1-minute detection)
  • Database connection health
  • Redis/queue health
  • Memory usage tracking

Performance Monitoring:

  • Slow database queries (>100ms)
  • High API response times (>2s)
  • Error rate tracking (5xx errors)

Infrastructure Monitoring:

  • Memory usage (warning at 70%, critical at 90%)
  • CPU usage (warning at 80%)
  • Container restart detection

Business Metrics:

  • Submission tracking (received, converted, rejected)
  • Conversion rate monitoring
  • Performance degradation detection

Notification System

Email Notifications (Active):

  • HTML-formatted alerts with severity color coding
  • Alert grouping (multiple similar alerts → 1 email)
  • Resolution notifications (when issues resolve)
  • Critical alerts sent immediately (10s group wait)
  • Warning alerts batched (1m group wait)

Severity Levels:

| Severity | Response Time | Examples | | -------- | ------------- | ------------------------------------------- | | CRITICAL | Minutes | API down, database failing, high error rate | | WARNING | Hours | Slow queries, high memory, no submissions |

Ready for Expansion:

  • Slack integration (commented out, ready to enable)
  • PagerDuty integration (commented out, ready to enable)

Dashboards

Health Monitoring Dashboard (10 Panels)

Access: Grafana → "Sampo Health Monitoring - BlueLine Alpha"

Panels:

  1. Overall Health Status - Real-time health check status
  2. Database Response Time - p50, p95, p99 percentiles
  3. Memory Usage - Percentage with thresholds
  4. Database Health - Connection status and latency
  5. Redis Health - Queue system status
  6. HTTP Request Rate - Requests per second by route
  7. HTTP Error Rate - 4xx and 5xx errors
  8. API Response Time - Request duration by percentile
  9. Health Check Duration - Component check performance
  10. Component Status - Database, Redis, Memory health

Business Metrics Dashboard (9 Panels)

Access: Grafana → "Sampo Business Metrics - BlueLine Alpha"

Panels:

  1. Submissions Received - Last 24h total
  2. Conversion Rate - Percentage gauge (target: >60%)
  3. Rejection Rate - Percentage gauge (target: <20%)
  4. Submissions Reconciled - Order processing count
  5. Submissions Timeline - Received/converted/rejected trends
  6. Conversion Duration - p50, p90, p99 processing time
  7. Submissions by Source - Pie chart breakdown
  8. Conversion Funnel - Visual pipeline (received → converted)
  9. Submissions by Deployment - Multi-deployment breakdown

Alert Examples

Critical Alert: API Down

Trigger: API unreachable for more than 1 minute

Email Notification:

Subject: [FIRING:1] APIDown BlueLine Alpha

Alert: APIDown
Severity: CRITICAL
Description: API endpoint has been unreachable for more than 1 minute
Deployment: blueline
Instance: blueline-alpha-api:3001
Started: 2026-02-08 14:23:15 UTC

This requires immediate attention. Service is completely unavailable.

Expected Action: Immediate investigation (see runbook in alert email)

Warning Alert: High Memory Usage

Trigger: Memory usage >70% for more than 10 minutes

Email Notification:

Subject: [FIRING:1] MemoryUsageWarning BlueLine Alpha

Alert: MemoryUsageWarning
Severity: WARNING
Description: Memory usage has exceeded 70% for more than 10 minutes
Deployment: blueline
Current Usage: 74.5%
Started: 2026-02-08 14:30:00 UTC

Monitor for increasing trend. May need investigation or restart.

Expected Action: Monitor trend, investigate if climbing toward 90%


Accessing Monitoring Tools

Grafana Dashboards

URL: http://localhost:3004 (local) or http://<VPS-IP>:3004 (production)

Login:

  • Username: admin
  • Password: Check GRAFANA_ADMIN_PASSWORD in deployment .env file

Navigation:

  1. Log in to Grafana
  2. Click "Dashboards" (left sidebar)
  3. Select dashboard:
    • "Sampo Health Monitoring - BlueLine Alpha"
    • "Sampo Business Metrics - BlueLine Alpha"

Prometheus (Advanced Users)

URL: http://localhost:9090 (requires SSH tunnel for production)

SSH Tunnel Setup:

ssh -L 9090:localhost:9090 root@<VPS-IP>
# Then access: http://localhost:9090

Use Cases:

  • Custom metric queries (PromQL)
  • Alert rule inspection
  • Target health verification

Alertmanager (Advanced Users)

URL: http://localhost:9093 (requires SSH tunnel for production)

SSH Tunnel Setup:

ssh -L 9093:localhost:9093 root@<VPS-IP>
# Then access: http://localhost:9093

Use Cases:

  • View active alerts
  • Silence alerts temporarily
  • View alert history
  • Test notification receivers

Alert Grouping & Deduplication

How It Works

Alerts with the same alertname, deployment, and severity are grouped into a single notification to prevent spam.

Example:

If 5 containers restart simultaneously, you receive 1 email listing all 5 restarts, not 5 separate emails.

Grouping Configuration

  • Group Wait: Time before sending first notification
    • Critical: 10 seconds
    • Warning: 1 minute
  • Repeat Interval: Time before re-sending if alert still firing
    • Critical: 30 minutes
    • Warning: 2 hours

Inhibition Rules

What are inhibition rules?

Rules that suppress lower-severity alerts when higher-severity alerts are already firing for the same deployment.

Example:

If both HighMemoryUsage (CRITICAL, >90%) and MemoryUsageWarning (WARNING,

70%) fire simultaneously, only the CRITICAL alert is sent.

Benefit: Reduces alert fatigue by focusing on the most urgent issues.


Common Scenarios

Scenario 1: Deployment Causes Alert

What Happens:

  1. Deploy new code version
  2. Container restarts (expected)
  3. ContainerRestarted WARNING alert fires
  4. Email notification received

Is This Normal?

Yes - Container restarts during deployments are expected. Alert resolves automatically after 5 minutes of uptime.

Action Required: None (unless container keeps restarting)


Scenario 2: Slow Database Queries Alert

What Happens:

  1. Database queries slow down (p99 >100ms)
  2. SlowDatabaseQueries WARNING alert fires after 5 minutes
  3. Email notification received

Action Required:

  1. Check Grafana "Database Response Time" panel
  2. Identify affected queries in API logs
  3. Investigate missing indexes or query optimization
  4. If p99 >500ms for >15 minutes, escalate to backend team

Scenario 3: No Submissions Received

What Happens:

  1. Zero submissions received in last hour
  2. NoSubmissionsReceived WARNING alert fires
  3. Email notification received

Action Required:

  1. Check if it's business hours (alert may be expected overnight)
  2. Verify external integration status (webhook endpoints)
  3. Check API logs for submission endpoint errors
  4. Test submission endpoint manually
  5. If >4 hours during business hours, escalate to integration team

Metrics Collection

Application Metrics (from NestJS API)

Endpoint: http://localhost:3003/metrics

Metrics Collected:

  • HTTP requests (total, duration, by route/method/status)
  • Health check status (database, redis, memory, overall)
  • Database response time
  • Memory usage percentage
  • Submission business metrics (received, converted, rejected, reconciled)

Collection Frequency: Every 60 seconds

System Metrics (from Node Exporter)

Endpoint: http://localhost:9100/metrics

Metrics Collected:

  • CPU usage by mode (user, system, idle)
  • Memory statistics (used, available, cached)
  • Disk I/O and usage
  • Network statistics (bytes sent/received)
  • Filesystem usage

Collection Frequency: Every 60 seconds

Database Metrics (from PostgreSQL Exporter)

Endpoint: http://localhost:9187/metrics

Metrics Collected:

  • Active database connections
  • Deadlock count
  • Rows inserted/updated/deleted
  • Block reads/writes
  • Query statistics

Collection Frequency: Every 60 seconds

Redis Metrics (from Redis Exporter)

Endpoint: http://localhost:9121/metrics

Metrics Collected:

  • Connected clients
  • Memory usage
  • Evicted keys (cache pressure)
  • Command counts by type
  • Keyspace statistics

Collection Frequency: Every 60 seconds


Performance Impact

Resource Usage (monitoring stack):

  • Prometheus: ~200MB memory, <5% CPU
  • Alertmanager: ~20MB memory, <1% CPU
  • Grafana: ~100MB memory, <2% CPU
  • Exporters (3 total): ~65MB memory, <4% CPU

Total Overhead: ~385MB memory, <12% CPU (acceptable for production)

Storage:

  • Prometheus data: ~6.5GB/month with 30-day retention
  • Alertmanager data: ~10MB (alert state)

Next Steps

  1. Learn Alert Responses → See Alert Response Guide
  2. Configure Email Notifications → See Monitoring Configuration
  3. View Dashboards → Access Grafana at http://localhost:3004
  4. Explore Metrics → Access Prometheus at http://localhost:9090 (via SSH tunnel)

Related Articles


Support

For monitoring system issues or questions:

  • Documentation: docs/operations/alerting-guide.md (complete reference)
  • Alert Runbooks: Included in alert notification emails
  • Technical Support: Contact DevOps team

Last Updated: 2026-02-08
System Version: Week 3+ (Alertmanager + Exporters + Business Dashboards)

Was this article helpful?

Your feedback helps us improve our support content.

Still need assistance?

Our support team is ready to help you with more complex issues.

Contact Support