Alert Response Guide

Last updated: February 8, 2026
Admin Tools

Alert Response Guide

This guide provides step-by-step response procedures for common alerts. Each alert includes diagnosis steps, resolution procedures, and escalation criteria.


Critical Alerts (Immediate Action Required)

🔴 APIDown - API Unreachable

Alert Trigger: API endpoint unreachable for more than 1 minute

Impact: Complete service outage, all API requests failing

Immediate Actions:

  1. Check container status:

    ssh root@<VPS-IP>
    docker ps -a --filter "name=blueline-alpha-api"
    
  2. If container stopped, restart:

    docker start blueline-alpha-api
    docker logs blueline-alpha-api --tail 100
    
  3. If container running but health check failing:

    curl http://localhost:3003/health
    docker logs blueline-alpha-api --tail 100
    
  4. Check database connectivity:

    docker exec blueline-alpha-db pg_isready -U postgres -d sampo_blueline_alpha
    

Common Causes:

  • Out of memory (OOM) crash
  • Database connection failure
  • Environment variable misconfiguration
  • Code deployment error

Escalation: If restart doesn't resolve within 5 minutes, rollback to previous deployment:

./scripts/rollback-deployment.sh root@<VPS-IP> previous

🔴 DatabaseHealthCheckFailing - Database Unavailable

Alert Trigger: Database health check failing for more than 2 minutes

Impact: Cannot read/write to database, data operations blocked

Immediate Actions:

  1. Check database container:

    docker ps --filter "name=blueline-alpha-db"
    docker logs blueline-alpha-db --tail 100
    
  2. Test database connectivity:

    docker exec blueline-alpha-db pg_isready -U postgres -d sampo_blueline_alpha
    
  3. Check connection count:

    docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
      "SELECT count(*) FROM pg_stat_activity WHERE datname='sampo_blueline_alpha';"
    
  4. If too many connections (>80), kill idle:

    docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
      "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '10 minutes';"
    
  5. If database unresponsive, restart:

    docker restart blueline-alpha-db
    sleep 30
    docker logs blueline-alpha-db --tail 50
    

Common Causes:

  • Connection pool exhausted
  • Deadlocks
  • Disk space full
  • PostgreSQL crash

🔴 HighMemoryUsage - Memory Critical

Alert Trigger: Memory usage above 90% for more than 5 minutes

Impact: Risk of OOM crash, service restart imminent

Immediate Actions:

  1. Check current memory:

    docker stats blueline-alpha-api --no-stream
    
  2. If >95%, restart immediately:

    docker restart blueline-alpha-api
    sleep 30
    docker logs blueline-alpha-api --tail 50
    
  3. Monitor post-restart:

    watch -n 5 'curl -s http://localhost:3003/health | jq .memory.percentUsed'
    

Prevention:

  • Increase container memory limit in docker-compose file
  • Add memory leak detection to CI/CD pipeline
  • Schedule daily restarts during off-hours (temporary mitigation)

🔴 HighErrorRate - Service Degraded

Alert Trigger: 5xx error rate exceeds 5% for more than 5 minutes

Impact: High failure rate, users experiencing errors

Immediate Actions:

  1. Identify error types:

    docker logs blueline-alpha-api --tail 500 | grep "500\|501\|502\|503\|504"
    
  2. Check for common error patterns:

    docker logs blueline-alpha-api --tail 500 | grep -i "error" | sort | uniq -c | sort -nr
    
  3. Check database connectivity:

    curl http://localhost:3003/health | jq .database
    
  4. If recent deployment (<10 minutes uptime), rollback:

    docker ps --filter "name=blueline-alpha-api" --format "{{.Status}}"
    ./scripts/rollback-deployment.sh root@<VPS-IP> previous
    

Common Causes:

  • Recent code deployment bug
  • Database connection failure
  • Missing environment variables
  • External service outage

Warning Alerts (Monitor & Investigate)

⚠️ SlowDatabaseQueries - Performance Degraded

Alert Trigger: p99 database query time exceeds 100ms for more than 5 minutes

Impact: Slow user experience, increased API response times

Investigation Actions:

  1. Check Grafana for query patterns:

    • Navigate to Grafana → "Sampo Health Monitoring"
    • Panel: "Database Response Time"
    • Identify spike timing
  2. View slow queries in PostgreSQL (requires pg_stat_statements):

    docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
      "SELECT query, calls, total_exec_time, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
    
  3. Check API logs for slow query warnings:

    docker logs blueline-alpha-api --tail 500 | grep "slow query"
    
  4. Check database connections:

    docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
      "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
    

Common Causes:

  • Missing database indexes
  • Full table scans
  • Connection pool exhaustion
  • Unoptimized queries

Escalation: If p99 >500ms for >15 minutes, engage backend team for query optimization.


⚠️ MemoryUsageWarning - Memory Pressure

Alert Trigger: Memory usage above 70% for more than 10 minutes

Impact: System may slow down, risk of OOM crashes

Investigation Actions:

  1. Check current memory usage:

    curl http://localhost:3003/health | jq .memory
    docker stats blueline-alpha-api --no-stream
    
  2. Identify memory consumers:

    docker exec blueline-alpha-api ps aux --sort=-%mem | head -10
    
  3. Check for memory leaks in logs:

    docker logs blueline-alpha-api --tail 200 | grep -i "heap\|memory"
    
  4. Monitor trend in Grafana:

    • Navigate to "Sampo Health Monitoring" dashboard
    • Check "Memory Usage" panel (last 6 hours)

Mitigation:

  • Memory climbing steadily → Likely memory leak, schedule restart during off-hours
  • Memory spikes temporarily → Normal operation, no action needed
  • Memory >85% → Consider restarting API container

⚠️ HighCPUUsage - Processing Load High

Alert Trigger: CPU usage above 80% for more than 10 minutes

Impact: System slowing down, may affect response times

Investigation Actions:

  1. Check CPU usage:

    docker stats blueline-alpha-api --no-stream
    
  2. Identify CPU-intensive processes:

    docker exec blueline-alpha-api ps aux --sort=-%cpu | head -10
    
  3. Check for infinite loops or runaway processes:

    docker logs blueline-alpha-api --tail 500 | grep -i "timeout\|loop"
    
  4. Check for high request volume:

    • Grafana → "Sampo Health Monitoring"
    • Panel: "HTTP Request Rate"

Common Causes:

  • Traffic spike
  • Inefficient algorithm (e.g., nested loops)
  • External service timeout causing retries
  • CPU-intensive background jobs

⚠️ ContainerRestarted - Service Restarted

Alert Trigger: Container uptime less than 5 minutes

Impact: Potential service disruption, may indicate crash loop

Investigation Actions:

  1. Check why container restarted:

    docker inspect blueline-alpha-api | jq '.[0].State'
    
  2. Check for OOM (Out of Memory) kill:

    dmesg | grep -i "out of memory"
    docker logs blueline-alpha-api --tail 100 | grep -i "killed\|oom"
    
  3. Check for crash:

    docker logs blueline-alpha-api --tail 200 | grep -i "error\|fatal\|crash"
    
  4. If recurring restarts, check restart count:

    docker inspect blueline-alpha-api | jq '.[0].RestartCount'
    

Common Causes:

  • OOM crash (see HighMemoryUsage alert)
  • Unhandled exception
  • Health check failure triggering Docker restart
  • Manual restart during deployment

Note: Restarts during deployments are expected and resolve automatically after 5 minutes.


Business Alerts

⚠️ NoSubmissionsReceived - Integration Issue

Alert Trigger: Zero submissions received in the last hour

Impact: Potential integration failure, revenue loss

Investigation Actions:

  1. Check if it's business hours:

    • Alert may be expected overnight or on weekends
    • Verify historical submission patterns in Grafana
  2. Check Grafana business dashboard:

    • Navigate to "Sampo Business Metrics"
    • Panel: "Submissions Timeline"
    • Identify when submissions stopped
  3. Check submission endpoint health:

    curl -X POST http://localhost:3003/api/v1/listing-submissions \
      -H "Content-Type: application/json" \
      -d '{"test": "data"}' \
      -w "\nHTTP Status: %{http_code}\n"
    
  4. Check API logs for submission errors:

    docker logs blueline-alpha-api --tail 500 | grep -i "submission"
    
  5. Check external integrations:

    • Verify webhook endpoints are reachable
    • Check API keys/credentials validity
    • Test external service status page

Common Causes:

  • External integration failure (e.g., order system down)
  • API key expired/rotated
  • Network connectivity issue
  • Weekend/holiday (normal business cycle)

Escalation: If no submissions for >4 hours during business hours, engage integration team.


⚠️ HighSubmissionRejectionRate - Data Quality Issue

Alert Trigger: Submission rejection rate exceeds 30% for more than 15 minutes

Impact: Data quality issues, customer dissatisfaction

Investigation Actions:

  1. Check Grafana business dashboard:

    • Navigate to "Sampo Business Metrics"
    • Panel: "Submission Rejection Rate"
    • Check "Conversion Funnel"
  2. Check API logs for rejection reasons:

    docker logs blueline-alpha-api --tail 500 | grep "submission.*reject"
    
  3. Identify common rejection reasons:

    docker logs blueline-alpha-api --tail 1000 | grep "rejection_reason" | \
      jq -r '.rejection_reason' | sort | uniq -c | sort -nr
    
  4. Check for data quality issues:

    • Missing required fields
    • Invalid formats (dates, emails, phone numbers)
    • Duplicate submissions

Common Causes:

  • Data quality issue from submission source
  • Validation rule changed (recent deployment)
  • External data provider sending bad data
  • System configuration change

Escalation: If rejection rate >50%, contact business team to review validation rules.


⚠️ SlowSubmissionConversion - Processing Delayed

Alert Trigger: p90 submission conversion time exceeds 60 seconds for more than 15 minutes

Impact: Slow order processing, customer complaints

Investigation Actions:

  1. Check Grafana business dashboard:

    • Navigate to "Sampo Business Metrics"
    • Panel: "Submission Conversion Duration"
    • Identify performance degradation timing
  2. Check for slow database queries:

    • See "SlowDatabaseQueries" alert response above
    • Correlate timing with database performance
  3. Check for external service delays:

    docker logs blueline-alpha-api --tail 500 | grep "external.*timeout\|external.*slow"
    
  4. Check queue processing backlog:

    docker exec redis redis-cli LLEN submission_processing_queue
    

Common Causes:

  • Slow database queries
  • External service timeout (e.g., payment gateway)
  • High queue backlog
  • Missing database indexes

Infrastructure Alerts

⚠️ PostgreSQLExporterDown - Metrics Loss

Alert Trigger: PostgreSQL exporter unreachable for more than 2 minutes

Impact: Loss of database metrics visibility (monitoring only)

Actions:

  1. Check exporter container:

    docker ps --filter "name=blueline-alpha-postgres-exporter"
    docker logs blueline-alpha-postgres-exporter --tail 100
    
  2. If container stopped, restart:

    docker start blueline-alpha-postgres-exporter
    sleep 10
    docker logs blueline-alpha-postgres-exporter --tail 50
    
  3. Test exporter metrics endpoint:

    curl http://localhost:9187/metrics | head -20
    

Common Causes:

  • Container stopped/crashed
  • Database connection failure
  • Incorrect DATA_SOURCE_NAME environment variable

Note: This does not affect application functionality, only metrics collection.


⚠️ RedisExporterDown - Metrics Loss

Alert Trigger: Redis exporter unreachable for more than 2 minutes

Impact: Loss of Redis/queue metrics visibility (monitoring only)

Actions:

  1. Check exporter container:

    docker ps --filter "name=blueline-alpha-redis-exporter"
    docker logs blueline-alpha-redis-exporter --tail 100
    
  2. If container stopped, restart:

    docker start blueline-alpha-redis-exporter
    sleep 10
    docker logs blueline-alpha-redis-exporter --tail 50
    
  3. Test exporter metrics endpoint:

    curl http://localhost:9121/metrics | head -20
    

Common Causes:

  • Container stopped/crashed
  • Redis connection failure
  • Incorrect REDIS_ADDR environment variable

Database Alerts (Advanced)

⚠️ TooManyDatabaseConnections - Connection Pressure

Alert Trigger: Active database connections exceed 80 for more than 5 minutes

Impact: Connection pool exhaustion risk, may block new requests

Actions:

  1. Check current connection count:

    docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
      "SELECT count(*) as total_connections, state FROM pg_stat_activity WHERE datname='sampo_blueline_alpha' GROUP BY state;"
    
  2. Identify connection sources:

    docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
      "SELECT client_addr, count(*) FROM pg_stat_activity WHERE datname='sampo_blueline_alpha' GROUP BY client_addr ORDER BY count DESC;"
    
  3. Kill idle connections (if >20 idle):

    docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
      "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '10 minutes';"
    

Common Causes:

  • Connection pool misconfiguration
  • Connection leak (application not closing connections)
  • Increased traffic
  • Long-running queries holding connections

Alert Escalation Matrix

| Alert | Severity | Initial Response | Escalate If... | Escalate To | | --------------------------- | -------- | ---------------- | --------------------------------------------- | ---------------- | | APIDown | CRITICAL | Immediate | Not resolved in 5 minutes | On-call DevOps | | DatabaseHealthFailing | CRITICAL | Immediate | Not resolved in 5 minutes | Database Admin | | HighMemoryUsage | CRITICAL | Immediate | Memory climbs again after restart | Backend Team | | HighErrorRate | CRITICAL | Immediate | Error rate >10% or not resolved in 10 minutes | Backend Team | | SlowDatabaseQueries | WARNING | Investigate | p99 >500ms for >15 minutes | Backend Team | | HighSubmissionRejectionRate | WARNING | Investigate | Rejection rate >50% | Business Team | | NoSubmissionsReceived | WARNING | Investigate | >4 hours during business hours | Integration Team |


Quick Reference: Common Commands

Container Management

# Check container status
docker ps -a --filter "name=blueline-alpha"

# Restart API container
docker restart blueline-alpha-api

# View API logs
docker logs blueline-alpha-api --tail 100

# Check health endpoint
curl http://localhost:3003/health | jq

Database Management

# Check database connectivity
docker exec blueline-alpha-db pg_isready -U postgres -d sampo_blueline_alpha

# Count active connections
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname='sampo_blueline_alpha';"

# Kill idle connections
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '10 minutes';"

Monitoring Access

# SSH tunnel for Prometheus (port 9090)
ssh -L 9090:localhost:9090 root@<VPS-IP>

# SSH tunnel for Alertmanager (port 9093)
ssh -L 9093:localhost:9093 root@<VPS-IP>

# Access Grafana directly (port 3004)
http://<VPS-IP>:3004

Understanding Alert Emails

Email Subject Format

[FIRING:1] AlertName Deployment
[RESOLVED] AlertName Deployment
  • FIRING - Alert is currently active
  • RESOLVED - Alert has been resolved
  • Number (e.g., 1) - Count of grouped alerts

Email Body Contents

  • Alert Name - Unique alert identifier
  • Severity - CRITICAL or WARNING
  • Description - What the alert means
  • Deployment - Which deployment is affected
  • Instance - Specific service/container
  • Started At - When alert first fired
  • Runbook - Link to response procedures (if available)

Related Articles


Support

For alert response questions or escalation:

  • Complete Runbooks: docs/operations/alerting-guide.md (50KB reference)
  • Alert Emails: Include direct links to relevant runbooks
  • On-Call Support: Contact DevOps team for critical alerts

Last Updated: 2026-02-08
Alert Count: 25+ alerts across 5 categories

Was this article helpful?

Your feedback helps us improve our support content.

Still need assistance?

Our support team is ready to help you with more complex issues.

Contact Support