Alert Response Guide

Last updated: February 8, 2026

Admin Tools

Alert Response Guide

This guide provides step-by-step response procedures for common alerts. Each alert includes diagnosis steps, resolution procedures, and escalation criteria.

Critical Alerts (Immediate Action Required)

🔴 APIDown - API Unreachable

Alert Trigger: API endpoint unreachable for more than 1 minute

Impact: Complete service outage, all API requests failing

Immediate Actions:

Check container status:

ssh root@<VPS-IP>
docker ps -a --filter "name=blueline-alpha-api"

If container stopped, restart:

docker start blueline-alpha-api
docker logs blueline-alpha-api --tail 100

If container running but health check failing:

curl http://localhost:3003/health
docker logs blueline-alpha-api --tail 100

Check database connectivity:

docker exec blueline-alpha-db pg_isready -U postgres -d sampo_blueline_alpha

Common Causes:

Out of memory (OOM) crash
Database connection failure
Environment variable misconfiguration
Code deployment error

Escalation: If restart doesn't resolve within 5 minutes, rollback to previous deployment:

./scripts/rollback-deployment.sh root@<VPS-IP> previous

🔴 DatabaseHealthCheckFailing - Database Unavailable

Alert Trigger: Database health check failing for more than 2 minutes

Impact: Cannot read/write to database, data operations blocked

Immediate Actions:

Check database container:

docker ps --filter "name=blueline-alpha-db"
docker logs blueline-alpha-db --tail 100

Test database connectivity:

docker exec blueline-alpha-db pg_isready -U postgres -d sampo_blueline_alpha

Check connection count:

docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname='sampo_blueline_alpha';"

If too many connections (>80), kill idle:

docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '10 minutes';"

If database unresponsive, restart:

docker restart blueline-alpha-db
sleep 30
docker logs blueline-alpha-db --tail 50

Common Causes:

Connection pool exhausted
Deadlocks
Disk space full
PostgreSQL crash

🔴 HighMemoryUsage - Memory Critical

Alert Trigger: Memory usage above 90% for more than 5 minutes

Impact: Risk of OOM crash, service restart imminent

Immediate Actions:

Check current memory:

docker stats blueline-alpha-api --no-stream

If >95%, restart immediately:

docker restart blueline-alpha-api
sleep 30
docker logs blueline-alpha-api --tail 50

Monitor post-restart:

watch -n 5 'curl -s http://localhost:3003/health | jq .memory.percentUsed'

Prevention:

Increase container memory limit in docker-compose file
Add memory leak detection to CI/CD pipeline
Schedule daily restarts during off-hours (temporary mitigation)

🔴 HighErrorRate - Service Degraded

Alert Trigger: 5xx error rate exceeds 5% for more than 5 minutes

Impact: High failure rate, users experiencing errors

Immediate Actions:

Identify error types:

docker logs blueline-alpha-api --tail 500 | grep "500\|501\|502\|503\|504"

Check for common error patterns:

docker logs blueline-alpha-api --tail 500 | grep -i "error" | sort | uniq -c | sort -nr

Check database connectivity:

curl http://localhost:3003/health | jq .database

If recent deployment (<10 minutes uptime), rollback:

docker ps --filter "name=blueline-alpha-api" --format "{{.Status}}"
./scripts/rollback-deployment.sh root@<VPS-IP> previous

Common Causes:

Recent code deployment bug
Database connection failure
Missing environment variables
External service outage

Warning Alerts (Monitor & Investigate)

⚠️ SlowDatabaseQueries - Performance Degraded

Alert Trigger: p99 database query time exceeds 100ms for more than 5 minutes

Impact: Slow user experience, increased API response times

Investigation Actions:

Check Grafana for query patterns:
- Navigate to Grafana → "Sampo Health Monitoring"
- Panel: "Database Response Time"
- Identify spike timing

View slow queries in PostgreSQL (requires pg_stat_statements):

docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
  "SELECT query, calls, total_exec_time, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

Check API logs for slow query warnings:

docker logs blueline-alpha-api --tail 500 | grep "slow query"

Check database connections:

docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
  "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

Common Causes:

Missing database indexes
Full table scans
Connection pool exhaustion
Unoptimized queries

Escalation: If p99 >500ms for >15 minutes, engage backend team for query optimization.

⚠️ MemoryUsageWarning - Memory Pressure

Alert Trigger: Memory usage above 70% for more than 10 minutes

Impact: System may slow down, risk of OOM crashes

Investigation Actions:

Check current memory usage:

curl http://localhost:3003/health | jq .memory
docker stats blueline-alpha-api --no-stream

Identify memory consumers:

docker exec blueline-alpha-api ps aux --sort=-%mem | head -10

Check for memory leaks in logs:

docker logs blueline-alpha-api --tail 200 | grep -i "heap\|memory"

Monitor trend in Grafana:
- Navigate to "Sampo Health Monitoring" dashboard
- Check "Memory Usage" panel (last 6 hours)

Mitigation:

Memory climbing steadily → Likely memory leak, schedule restart during off-hours
Memory spikes temporarily → Normal operation, no action needed
Memory >85% → Consider restarting API container

⚠️ HighCPUUsage - Processing Load High

Alert Trigger: CPU usage above 80% for more than 10 minutes

Impact: System slowing down, may affect response times

Investigation Actions:

Check CPU usage:

docker stats blueline-alpha-api --no-stream

Identify CPU-intensive processes:

docker exec blueline-alpha-api ps aux --sort=-%cpu | head -10

Check for infinite loops or runaway processes:

docker logs blueline-alpha-api --tail 500 | grep -i "timeout\|loop"

Check for high request volume:
- Grafana → "Sampo Health Monitoring"
- Panel: "HTTP Request Rate"

Common Causes:

Traffic spike
Inefficient algorithm (e.g., nested loops)
External service timeout causing retries
CPU-intensive background jobs

⚠️ ContainerRestarted - Service Restarted

Alert Trigger: Container uptime less than 5 minutes

Impact: Potential service disruption, may indicate crash loop

Investigation Actions:

Check why container restarted:

docker inspect blueline-alpha-api | jq '.[0].State'

Check for OOM (Out of Memory) kill:

dmesg | grep -i "out of memory"
docker logs blueline-alpha-api --tail 100 | grep -i "killed\|oom"

Check for crash:

docker logs blueline-alpha-api --tail 200 | grep -i "error\|fatal\|crash"

If recurring restarts, check restart count:

docker inspect blueline-alpha-api | jq '.[0].RestartCount'

Common Causes:

OOM crash (see HighMemoryUsage alert)
Unhandled exception
Health check failure triggering Docker restart
Manual restart during deployment

Note: Restarts during deployments are expected and resolve automatically after 5 minutes.

Business Alerts

⚠️ NoSubmissionsReceived - Integration Issue

Alert Trigger: Zero submissions received in the last hour

Impact: Potential integration failure, revenue loss

Investigation Actions:

Check if it's business hours:
- Alert may be expected overnight or on weekends
- Verify historical submission patterns in Grafana
Check Grafana business dashboard:
- Navigate to "Sampo Business Metrics"
- Panel: "Submissions Timeline"
- Identify when submissions stopped

Check submission endpoint health:

curl -X POST http://localhost:3003/api/v1/listing-submissions \
  -H "Content-Type: application/json" \
  -d '{"test": "data"}' \
  -w "\nHTTP Status: %{http_code}\n"

Check API logs for submission errors:

docker logs blueline-alpha-api --tail 500 | grep -i "submission"

Check external integrations:
- Verify webhook endpoints are reachable
- Check API keys/credentials validity
- Test external service status page

Common Causes:

External integration failure (e.g., order system down)
API key expired/rotated
Network connectivity issue
Weekend/holiday (normal business cycle)

Escalation: If no submissions for >4 hours during business hours, engage integration team.

⚠️ HighSubmissionRejectionRate - Data Quality Issue

Alert Trigger: Submission rejection rate exceeds 30% for more than 15 minutes

Impact: Data quality issues, customer dissatisfaction

Investigation Actions:

Check Grafana business dashboard:
- Navigate to "Sampo Business Metrics"
- Panel: "Submission Rejection Rate"
- Check "Conversion Funnel"

Check API logs for rejection reasons:

docker logs blueline-alpha-api --tail 500 | grep "submission.*reject"

Identify common rejection reasons:

docker logs blueline-alpha-api --tail 1000 | grep "rejection_reason" | \
  jq -r '.rejection_reason' | sort | uniq -c | sort -nr

Check for data quality issues:
- Missing required fields
- Invalid formats (dates, emails, phone numbers)
- Duplicate submissions

Common Causes:

Data quality issue from submission source
Validation rule changed (recent deployment)
External data provider sending bad data
System configuration change

Escalation: If rejection rate >50%, contact business team to review validation rules.

⚠️ SlowSubmissionConversion - Processing Delayed

Alert Trigger: p90 submission conversion time exceeds 60 seconds for more than 15 minutes

Impact: Slow order processing, customer complaints

Investigation Actions:

Check Grafana business dashboard:
- Navigate to "Sampo Business Metrics"
- Panel: "Submission Conversion Duration"
- Identify performance degradation timing
Check for slow database queries:
- See "SlowDatabaseQueries" alert response above
- Correlate timing with database performance

Check for external service delays:

docker logs blueline-alpha-api --tail 500 | grep "external.*timeout\|external.*slow"

Check queue processing backlog:

docker exec redis redis-cli LLEN submission_processing_queue

Common Causes:

Slow database queries
External service timeout (e.g., payment gateway)
High queue backlog
Missing database indexes

Infrastructure Alerts

⚠️ PostgreSQLExporterDown - Metrics Loss

Alert Trigger: PostgreSQL exporter unreachable for more than 2 minutes

Impact: Loss of database metrics visibility (monitoring only)

Actions:

Check exporter container:

docker ps --filter "name=blueline-alpha-postgres-exporter"
docker logs blueline-alpha-postgres-exporter --tail 100

If container stopped, restart:

docker start blueline-alpha-postgres-exporter
sleep 10
docker logs blueline-alpha-postgres-exporter --tail 50

Test exporter metrics endpoint:

curl http://localhost:9187/metrics | head -20

Common Causes:

Container stopped/crashed
Database connection failure
Incorrect DATA_SOURCE_NAME environment variable

Note: This does not affect application functionality, only metrics collection.

⚠️ RedisExporterDown - Metrics Loss

Alert Trigger: Redis exporter unreachable for more than 2 minutes

Impact: Loss of Redis/queue metrics visibility (monitoring only)

Actions:

Check exporter container:

docker ps --filter "name=blueline-alpha-redis-exporter"
docker logs blueline-alpha-redis-exporter --tail 100

If container stopped, restart:

docker start blueline-alpha-redis-exporter
sleep 10
docker logs blueline-alpha-redis-exporter --tail 50

Test exporter metrics endpoint:

curl http://localhost:9121/metrics | head -20

Common Causes:

Container stopped/crashed
Redis connection failure
Incorrect REDIS_ADDR environment variable

Database Alerts (Advanced)

⚠️ TooManyDatabaseConnections - Connection Pressure

Alert Trigger: Active database connections exceed 80 for more than 5 minutes

Impact: Connection pool exhaustion risk, may block new requests

Actions:

Check current connection count:

docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
  "SELECT count(*) as total_connections, state FROM pg_stat_activity WHERE datname='sampo_blueline_alpha' GROUP BY state;"

Identify connection sources:

docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
  "SELECT client_addr, count(*) FROM pg_stat_activity WHERE datname='sampo_blueline_alpha' GROUP BY client_addr ORDER BY count DESC;"

Kill idle connections (if >20 idle):

docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '10 minutes';"

Common Causes:

Connection pool misconfiguration
Connection leak (application not closing connections)
Increased traffic
Long-running queries holding connections

Alert Escalation Matrix

Alert	Severity	Initial Response	Escalate If...	Escalate To
APIDown	CRITICAL	Immediate	Not resolved in 5 minutes	On-call DevOps
DatabaseHealthFailing	CRITICAL	Immediate	Not resolved in 5 minutes	Database Admin
HighMemoryUsage	CRITICAL	Immediate	Memory climbs again after restart	Backend Team
HighErrorRate	CRITICAL	Immediate	Error rate >10% or not resolved in 10 minutes	Backend Team
SlowDatabaseQueries	WARNING	Investigate	p99 >500ms for >15 minutes	Backend Team
HighSubmissionRejectionRate	WARNING	Investigate	Rejection rate >50%	Business Team
NoSubmissionsReceived	WARNING	Investigate	>4 hours during business hours	Integration Team

Quick Reference: Common Commands

Container Management

# Check container status
docker ps -a --filter "name=blueline-alpha"

# Restart API container
docker restart blueline-alpha-api

# View API logs
docker logs blueline-alpha-api --tail 100

# Check health endpoint
curl http://localhost:3003/health | jq

Database Management

# Check database connectivity
docker exec blueline-alpha-db pg_isready -U postgres -d sampo_blueline_alpha

# Count active connections
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname='sampo_blueline_alpha';"

# Kill idle connections
docker exec blueline-alpha-db psql -U postgres -d sampo_blueline_alpha -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '10 minutes';"

Monitoring Access

# SSH tunnel for Prometheus (port 9090)
ssh -L 9090:localhost:9090 root@<VPS-IP>

# SSH tunnel for Alertmanager (port 9093)
ssh -L 9093:localhost:9093 root@<VPS-IP>

# Access Grafana directly (port 3004)
http://<VPS-IP>:3004

Understanding Alert Emails

Email Subject Format

[FIRING:1] AlertName Deployment
[RESOLVED] AlertName Deployment

FIRING - Alert is currently active
RESOLVED - Alert has been resolved
Number (e.g., 1) - Count of grouped alerts

Email Body Contents

Alert Name - Unique alert identifier
Severity - CRITICAL or WARNING
Description - What the alert means
Deployment - Which deployment is affected
Instance - Specific service/container
Started At - When alert first fired
Runbook - Link to response procedures (if available)

Monitoring & Alerting Overview - System architecture and components
Monitoring Configuration - SMTP setup and notification channels
Deployment System Overview - Deployment resilience features

Support

For alert response questions or escalation:

Complete Runbooks: docs/operations/alerting-guide.md (50KB reference)
Alert Emails: Include direct links to relevant runbooks
On-Call Support: Contact DevOps team for critical alerts

Last Updated: 2026-02-08
Alert Count: 25+ alerts across 5 categories