Alert Response Guide
This guide provides step-by-step response procedures for common alerts. Each alert includes diagnosis steps, resolution procedures, and escalation criteria.
Critical Alerts (Immediate Action Required)
🔴 APIDown - API Unreachable
Alert Trigger: API endpoint unreachable for more than 1 minute
Impact: Complete service outage, all API requests failing
Immediate Actions:
-
Check container status:
ssh root@<VPS-IP> docker ps -a --filter "name=blueline-alpha-api" -
If container stopped, restart:
docker start blueline-alpha-api docker logs blueline-alpha-api --tail 100 -
If container running but health check failing:
curl http://localhost:3003/health docker logs blueline-alpha-api --tail 100 -
Check database connectivity:
docker exec blueline-alpha-db pg_isready -U postgres -d sampo_blueline_alpha
Common Causes:
- Out of memory (OOM) crash
- Database connection failure
- Environment variable misconfiguration
- Code deployment error
Escalation: If restart doesn't resolve within 5 minutes, rollback to previous deployment:
./scripts/rollback-deployment.sh root@<VPS-IP> previous
🔴 DatabaseHealthCheckFailing - Database Unavailable
Alert Trigger: Database health check failing for more than 2 minutes
Impact: Cannot read/write to database, data operations blocked
Immediate Actions:
-
Check database container:
docker ps --filter "name=blueline-alpha-db" docker logs blueline-alpha-db --tail 100 -
Test database connectivity:
docker exec blueline-alpha-db pg_isready -U postgres -d sampo_blueline_alpha -
Check connection count:
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \ "SELECT count(*) FROM pg_stat_activity WHERE datname='sampo_blueline_alpha';" -
If too many connections (>80), kill idle:
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \ "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '10 minutes';" -
If database unresponsive, restart:
docker restart blueline-alpha-db sleep 30 docker logs blueline-alpha-db --tail 50
Common Causes:
- Connection pool exhausted
- Deadlocks
- Disk space full
- PostgreSQL crash
🔴 HighMemoryUsage - Memory Critical
Alert Trigger: Memory usage above 90% for more than 5 minutes
Impact: Risk of OOM crash, service restart imminent
Immediate Actions:
-
Check current memory:
docker stats blueline-alpha-api --no-stream -
If >95%, restart immediately:
docker restart blueline-alpha-api sleep 30 docker logs blueline-alpha-api --tail 50 -
Monitor post-restart:
watch -n 5 'curl -s http://localhost:3003/health | jq .memory.percentUsed'
Prevention:
- Increase container memory limit in
docker-composefile - Add memory leak detection to CI/CD pipeline
- Schedule daily restarts during off-hours (temporary mitigation)
🔴 HighErrorRate - Service Degraded
Alert Trigger: 5xx error rate exceeds 5% for more than 5 minutes
Impact: High failure rate, users experiencing errors
Immediate Actions:
-
Identify error types:
docker logs blueline-alpha-api --tail 500 | grep "500\|501\|502\|503\|504" -
Check for common error patterns:
docker logs blueline-alpha-api --tail 500 | grep -i "error" | sort | uniq -c | sort -nr -
Check database connectivity:
curl http://localhost:3003/health | jq .database -
If recent deployment (<10 minutes uptime), rollback:
docker ps --filter "name=blueline-alpha-api" --format "{{.Status}}" ./scripts/rollback-deployment.sh root@<VPS-IP> previous
Common Causes:
- Recent code deployment bug
- Database connection failure
- Missing environment variables
- External service outage
Warning Alerts (Monitor & Investigate)
⚠️ SlowDatabaseQueries - Performance Degraded
Alert Trigger: p99 database query time exceeds 100ms for more than 5 minutes
Impact: Slow user experience, increased API response times
Investigation Actions:
-
Check Grafana for query patterns:
- Navigate to Grafana → "Sampo Health Monitoring"
- Panel: "Database Response Time"
- Identify spike timing
-
View slow queries in PostgreSQL (requires
pg_stat_statements):docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \ "SELECT query, calls, total_exec_time, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;" -
Check API logs for slow query warnings:
docker logs blueline-alpha-api --tail 500 | grep "slow query" -
Check database connections:
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \ "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
Common Causes:
- Missing database indexes
- Full table scans
- Connection pool exhaustion
- Unoptimized queries
Escalation: If p99 >500ms for >15 minutes, engage backend team for query optimization.
⚠️ MemoryUsageWarning - Memory Pressure
Alert Trigger: Memory usage above 70% for more than 10 minutes
Impact: System may slow down, risk of OOM crashes
Investigation Actions:
-
Check current memory usage:
curl http://localhost:3003/health | jq .memory docker stats blueline-alpha-api --no-stream -
Identify memory consumers:
docker exec blueline-alpha-api ps aux --sort=-%mem | head -10 -
Check for memory leaks in logs:
docker logs blueline-alpha-api --tail 200 | grep -i "heap\|memory" -
Monitor trend in Grafana:
- Navigate to "Sampo Health Monitoring" dashboard
- Check "Memory Usage" panel (last 6 hours)
Mitigation:
- Memory climbing steadily → Likely memory leak, schedule restart during off-hours
- Memory spikes temporarily → Normal operation, no action needed
- Memory >85% → Consider restarting API container
⚠️ HighCPUUsage - Processing Load High
Alert Trigger: CPU usage above 80% for more than 10 minutes
Impact: System slowing down, may affect response times
Investigation Actions:
-
Check CPU usage:
docker stats blueline-alpha-api --no-stream -
Identify CPU-intensive processes:
docker exec blueline-alpha-api ps aux --sort=-%cpu | head -10 -
Check for infinite loops or runaway processes:
docker logs blueline-alpha-api --tail 500 | grep -i "timeout\|loop" -
Check for high request volume:
- Grafana → "Sampo Health Monitoring"
- Panel: "HTTP Request Rate"
Common Causes:
- Traffic spike
- Inefficient algorithm (e.g., nested loops)
- External service timeout causing retries
- CPU-intensive background jobs
⚠️ ContainerRestarted - Service Restarted
Alert Trigger: Container uptime less than 5 minutes
Impact: Potential service disruption, may indicate crash loop
Investigation Actions:
-
Check why container restarted:
docker inspect blueline-alpha-api | jq '.[0].State' -
Check for OOM (Out of Memory) kill:
dmesg | grep -i "out of memory" docker logs blueline-alpha-api --tail 100 | grep -i "killed\|oom" -
Check for crash:
docker logs blueline-alpha-api --tail 200 | grep -i "error\|fatal\|crash" -
If recurring restarts, check restart count:
docker inspect blueline-alpha-api | jq '.[0].RestartCount'
Common Causes:
- OOM crash (see HighMemoryUsage alert)
- Unhandled exception
- Health check failure triggering Docker restart
- Manual restart during deployment
Note: Restarts during deployments are expected and resolve automatically after 5 minutes.
Business Alerts
⚠️ NoSubmissionsReceived - Integration Issue
Alert Trigger: Zero submissions received in the last hour
Impact: Potential integration failure, revenue loss
Investigation Actions:
-
Check if it's business hours:
- Alert may be expected overnight or on weekends
- Verify historical submission patterns in Grafana
-
Check Grafana business dashboard:
- Navigate to "Sampo Business Metrics"
- Panel: "Submissions Timeline"
- Identify when submissions stopped
-
Check submission endpoint health:
curl -X POST http://localhost:3003/api/v1/listing-submissions \ -H "Content-Type: application/json" \ -d '{"test": "data"}' \ -w "\nHTTP Status: %{http_code}\n" -
Check API logs for submission errors:
docker logs blueline-alpha-api --tail 500 | grep -i "submission" -
Check external integrations:
- Verify webhook endpoints are reachable
- Check API keys/credentials validity
- Test external service status page
Common Causes:
- External integration failure (e.g., order system down)
- API key expired/rotated
- Network connectivity issue
- Weekend/holiday (normal business cycle)
Escalation: If no submissions for >4 hours during business hours, engage integration team.
⚠️ HighSubmissionRejectionRate - Data Quality Issue
Alert Trigger: Submission rejection rate exceeds 30% for more than 15 minutes
Impact: Data quality issues, customer dissatisfaction
Investigation Actions:
-
Check Grafana business dashboard:
- Navigate to "Sampo Business Metrics"
- Panel: "Submission Rejection Rate"
- Check "Conversion Funnel"
-
Check API logs for rejection reasons:
docker logs blueline-alpha-api --tail 500 | grep "submission.*reject" -
Identify common rejection reasons:
docker logs blueline-alpha-api --tail 1000 | grep "rejection_reason" | \ jq -r '.rejection_reason' | sort | uniq -c | sort -nr -
Check for data quality issues:
- Missing required fields
- Invalid formats (dates, emails, phone numbers)
- Duplicate submissions
Common Causes:
- Data quality issue from submission source
- Validation rule changed (recent deployment)
- External data provider sending bad data
- System configuration change
Escalation: If rejection rate >50%, contact business team to review validation rules.
⚠️ SlowSubmissionConversion - Processing Delayed
Alert Trigger: p90 submission conversion time exceeds 60 seconds for more than 15 minutes
Impact: Slow order processing, customer complaints
Investigation Actions:
-
Check Grafana business dashboard:
- Navigate to "Sampo Business Metrics"
- Panel: "Submission Conversion Duration"
- Identify performance degradation timing
-
Check for slow database queries:
- See "SlowDatabaseQueries" alert response above
- Correlate timing with database performance
-
Check for external service delays:
docker logs blueline-alpha-api --tail 500 | grep "external.*timeout\|external.*slow" -
Check queue processing backlog:
docker exec redis redis-cli LLEN submission_processing_queue
Common Causes:
- Slow database queries
- External service timeout (e.g., payment gateway)
- High queue backlog
- Missing database indexes
Infrastructure Alerts
⚠️ PostgreSQLExporterDown - Metrics Loss
Alert Trigger: PostgreSQL exporter unreachable for more than 2 minutes
Impact: Loss of database metrics visibility (monitoring only)
Actions:
-
Check exporter container:
docker ps --filter "name=blueline-alpha-postgres-exporter" docker logs blueline-alpha-postgres-exporter --tail 100 -
If container stopped, restart:
docker start blueline-alpha-postgres-exporter sleep 10 docker logs blueline-alpha-postgres-exporter --tail 50 -
Test exporter metrics endpoint:
curl http://localhost:9187/metrics | head -20
Common Causes:
- Container stopped/crashed
- Database connection failure
- Incorrect
DATA_SOURCE_NAMEenvironment variable
Note: This does not affect application functionality, only metrics collection.
⚠️ RedisExporterDown - Metrics Loss
Alert Trigger: Redis exporter unreachable for more than 2 minutes
Impact: Loss of Redis/queue metrics visibility (monitoring only)
Actions:
-
Check exporter container:
docker ps --filter "name=blueline-alpha-redis-exporter" docker logs blueline-alpha-redis-exporter --tail 100 -
If container stopped, restart:
docker start blueline-alpha-redis-exporter sleep 10 docker logs blueline-alpha-redis-exporter --tail 50 -
Test exporter metrics endpoint:
curl http://localhost:9121/metrics | head -20
Common Causes:
- Container stopped/crashed
- Redis connection failure
- Incorrect
REDIS_ADDRenvironment variable
Database Alerts (Advanced)
⚠️ TooManyDatabaseConnections - Connection Pressure
Alert Trigger: Active database connections exceed 80 for more than 5 minutes
Impact: Connection pool exhaustion risk, may block new requests
Actions:
-
Check current connection count:
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \ "SELECT count(*) as total_connections, state FROM pg_stat_activity WHERE datname='sampo_blueline_alpha' GROUP BY state;" -
Identify connection sources:
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \ "SELECT client_addr, count(*) FROM pg_stat_activity WHERE datname='sampo_blueline_alpha' GROUP BY client_addr ORDER BY count DESC;" -
Kill idle connections (if >20 idle):
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \ "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '10 minutes';"
Common Causes:
- Connection pool misconfiguration
- Connection leak (application not closing connections)
- Increased traffic
- Long-running queries holding connections
Alert Escalation Matrix
| Alert | Severity | Initial Response | Escalate If... | Escalate To | | --------------------------- | -------- | ---------------- | --------------------------------------------- | ---------------- | | APIDown | CRITICAL | Immediate | Not resolved in 5 minutes | On-call DevOps | | DatabaseHealthFailing | CRITICAL | Immediate | Not resolved in 5 minutes | Database Admin | | HighMemoryUsage | CRITICAL | Immediate | Memory climbs again after restart | Backend Team | | HighErrorRate | CRITICAL | Immediate | Error rate >10% or not resolved in 10 minutes | Backend Team | | SlowDatabaseQueries | WARNING | Investigate | p99 >500ms for >15 minutes | Backend Team | | HighSubmissionRejectionRate | WARNING | Investigate | Rejection rate >50% | Business Team | | NoSubmissionsReceived | WARNING | Investigate | >4 hours during business hours | Integration Team |
Quick Reference: Common Commands
Container Management
# Check container status
docker ps -a --filter "name=blueline-alpha"
# Restart API container
docker restart blueline-alpha-api
# View API logs
docker logs blueline-alpha-api --tail 100
# Check health endpoint
curl http://localhost:3003/health | jq
Database Management
# Check database connectivity
docker exec blueline-alpha-db pg_isready -U postgres -d sampo_blueline_alpha
# Count active connections
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
"SELECT count(*) FROM pg_stat_activity WHERE datname='sampo_blueline_alpha';"
# Kill idle connections
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '10 minutes';"
Monitoring Access
# SSH tunnel for Prometheus (port 9090)
ssh -L 9090:localhost:9090 root@<VPS-IP>
# SSH tunnel for Alertmanager (port 9093)
ssh -L 9093:localhost:9093 root@<VPS-IP>
# Access Grafana directly (port 3004)
http://<VPS-IP>:3004
Understanding Alert Emails
Email Subject Format
[FIRING:1] AlertName Deployment
[RESOLVED] AlertName Deployment
FIRING- Alert is currently activeRESOLVED- Alert has been resolved- Number (e.g.,
1) - Count of grouped alerts
Email Body Contents
- Alert Name - Unique alert identifier
- Severity - CRITICAL or WARNING
- Description - What the alert means
- Deployment - Which deployment is affected
- Instance - Specific service/container
- Started At - When alert first fired
- Runbook - Link to response procedures (if available)
Related Articles
- Monitoring & Alerting Overview - System architecture and components
- Monitoring Configuration - SMTP setup and notification channels
- Deployment System Overview - Deployment resilience features
Support
For alert response questions or escalation:
- Complete Runbooks:
docs/operations/alerting-guide.md(50KB reference) - Alert Emails: Include direct links to relevant runbooks
- On-Call Support: Contact DevOps team for critical alerts
Last Updated: 2026-02-08
Alert Count: 25+ alerts across 5 categories