Event System Dashboard

Last updated: March 25, 2026

Admin Tools

Event System Dashboard

What is the Event System Dashboard?

The Event System Dashboard provides a real-time overview of Sampo's event-driven architecture - a system where different parts of the application communicate by sending and receiving events rather than direct API calls.

Why Events Matter

Think of events like a notification system:

When a new listing is created, an event is emitted
When a user updates their profile, an event is emitted
When a submission is converted to a listing, an event is emitted

These events trigger automated actions like:

Sending email notifications
Updating search indexes
Logging audit trails
Syncing data to external systems

Dashboard Overview

┌─────────────────────────────────────────────────────────────┐
│                    EVENT SYSTEM DASHBOARD                    │
├─────────────────────────────────────────────────────────────┤
│  [Stats Cards]    [Event Rate Chart]    [Latency Gauge]     │
│  • Total Events   • Events/min over   • P50/P95/P99        │
│  • Error Rate       time              processing times     │
│  • Avg Latency                                              │
├─────────────────────────────────────────────────────────────┤
│  [Top Event Types]          [Recent Failed Events]           │
│  • Most frequent events     • Errors needing attention       │
│  • Volume breakdown         • Quick links to Dead Letter     │
└─────────────────────────────────────────────────────────────┘

Understanding the Metrics

Stats Cards (Top Row)

Total Events

What it shows: Total number of events emitted in the current time period
Normal range: Varies by deployment activity (100-10,000+ per hour)
When to worry: Sudden drops to near zero indicate system issues
Green/Yellow/Red: Based on comparison to historical averages

Error Rate

What it shows: Percentage of events that failed processing
Formula: (Failed Events / Total Events) × 100
Normal: < 1%
Warning: 1-5% (yellow)
Critical: > 5% (red) - investigate immediately

Common causes of high error rate:

Database connection issues
External API failures (email service, QuickBooks, etc.)
Memory exhaustion
Code bugs in event handlers

Average Latency

What it shows: Average time to process an event (in milliseconds)
Normal: < 100ms for most events
Warning: 100-500ms
Critical: > 500ms

High latency indicates:

Slow database queries
External API delays
High system load
Inefficient event handlers

Event Rate Chart

What it shows: Events per minute over the last hour

How to read it:

Steady line: Normal operation
Spikes: High activity (e.g., bulk imports, marketing campaigns)
Drops: Potential issues or low activity periods
Pattern recognition: Look for daily/weekly patterns

Example scenarios:

Morning spike at 9 AM: Users starting work
Weekend dips: Lower business activity
Sudden flatline: System outage

Latency Gauge

What it shows: Processing time percentiles

P50 (50th percentile): Half of events process faster than this
P95 (95th percentile): 95% of events process faster than this
P99 (99th percentile): 99% of events process faster than this

Why percentiles matter:

Average (mean) can be misleading - one slow event skews it
P95/P99 show you the "worst case" experience
If P99 is 5 seconds, 1% of users wait 5+ seconds

Target values:

P50: < 50ms
P95: < 200ms
P99: < 500ms

Top Event Types

What it shows: The 5 most frequently emitted events

Example:

1. listing.created        - 1,234 events (45%)
2. user.updated           - 567 events (21%)
3. submission.converted   - 234 events (9%)
4. order.created          - 123 events (5%)
5. email.sent             - 89 events (3%)

How to use this:

Identify high-volume events that might need optimization
Spot unusual patterns (e.g., 10x normal user.login events = potential attack)
Plan capacity based on event volume

Recent Failed Events

What it shows: Last 10 events that failed to process

Columns:

Event Type: What kind of event failed
Error: Brief error message
Time: When it failed
Actions: Quick links to retry or view details

When to act:

Any failed event should be investigated
Multiple failures of same type = systematic issue
Click "View in Dead Letter Queue" for detailed analysis

How to Use This Page

Daily Health Check (2 minutes)

Check Error Rate - Should be < 1%
Check Latency - P95 should be < 200ms
Scan Recent Failures - Any new failures?
Review Top Events - Any unusual volumes?

When Investigating Issues

Scenario 1: Users report slow performance

Check Latency Gauge - are P95/P99 high?
Check Event Rate Chart - spike in volume?
Check Top Event Types - which events are slow?
Click through to Event Metrics for detailed analysis

Scenario 2: Error alert triggered

Note the error rate percentage
Check Recent Failed Events section
Click "View in Dead Letter Queue" for full details
Identify pattern (same event type? same error message?)
Retry failed events after fixing root cause

Scenario 3: Unusual activity detected

Check Event Rate Chart for spikes
Review Top Event Types for unexpected volumes
Compare to historical patterns
Investigate source (marketing campaign? bot traffic?)

Common Questions

Q: Why are some events showing as "failed"?

A: Events fail when:

Database is temporarily unavailable
External API (email, QuickBooks) times out
Event handler has a bug
System is under heavy load

Fix: Go to Dead Letter Queue to retry after resolving the issue.

Q: What does "P95 latency" mean?

A: 95% of events process faster than this time. If P95 is 200ms, then 95 out of 100 events complete in under 200ms.

Q: Why is my error rate 0% but users are complaining?

A: Check the Latency Gauge. Events might be succeeding but taking too long (slow performance vs. failures).

Q: Can I see individual events?

A: This dashboard shows aggregated statistics. For individual events with trace IDs and payloads, use the Event Flow page (Phase 3 will add database persistence for full event history).

Q: How far back does the data go?

A: Currently shows last hour of in-memory statistics. For historical trends, use the Event Metrics page with longer time ranges.

Best Practices

✅ Do

Check this page daily as part of system health monitoring
Investigate any error rate above 1%
Use latency metrics to identify performance degradation
Click through to Dead Letter Queue for failed event details
Export metrics before system maintenance for baseline comparison

❌ Don't

Ignore yellow/red indicators - they indicate real issues
Retry failed events without fixing the root cause
Assume zero errors means perfect performance (check latency too)
Panic over single event failures - look for patterns

Dead Letter Queue - Manage failed events
Event Flow - Real-time event stream
Event Metrics - Detailed analytics and trends
System Alerts - Automated alerting setup

Need Help?

If you see:

Error rate > 10%: Contact engineering immediately
Latency P99 > 5 seconds: System under severe stress
Zero events for > 5 minutes: Potential system outage
Same error repeating: Check Dead Letter Queue for details

Event System Dashboard

What is the Event System Dashboard?

Why Events Matter

Dashboard Overview

Understanding the Metrics

Stats Cards (Top Row)

Total Events

Error Rate

Average Latency

Event Rate Chart

Latency Gauge

Top Event Types

Recent Failed Events

How to Use This Page

Daily Health Check (2 minutes)

When Investigating Issues

Common Questions

Q: Why are some events showing as "failed"?

Q: What does "P95 latency" mean?

Q: Why is my error rate 0% but users are complaining?

Q: Can I see individual events?

Q: How far back does the data go?

Best Practices

✅ Do

❌ Don't

Related Pages

Need Help?