Event System Dashboard
What is the Event System Dashboard?
The Event System Dashboard provides a real-time overview of Sampo's event-driven architecture - a system where different parts of the application communicate by sending and receiving events rather than direct API calls.
Why Events Matter
Think of events like a notification system:
- When a new listing is created, an event is emitted
- When a user updates their profile, an event is emitted
- When a submission is converted to a listing, an event is emitted
These events trigger automated actions like:
- Sending email notifications
- Updating search indexes
- Logging audit trails
- Syncing data to external systems
Dashboard Overview
┌─────────────────────────────────────────────────────────────┐
│ EVENT SYSTEM DASHBOARD │
├─────────────────────────────────────────────────────────────┤
│ [Stats Cards] [Event Rate Chart] [Latency Gauge] │
│ • Total Events • Events/min over • P50/P95/P99 │
│ • Error Rate time processing times │
│ • Avg Latency │
├─────────────────────────────────────────────────────────────┤
│ [Top Event Types] [Recent Failed Events] │
│ • Most frequent events • Errors needing attention │
│ • Volume breakdown • Quick links to Dead Letter │
└─────────────────────────────────────────────────────────────┘
Understanding the Metrics
Stats Cards (Top Row)
Total Events
- What it shows: Total number of events emitted in the current time period
- Normal range: Varies by deployment activity (100-10,000+ per hour)
- When to worry: Sudden drops to near zero indicate system issues
- Green/Yellow/Red: Based on comparison to historical averages
Error Rate
- What it shows: Percentage of events that failed processing
- Formula:
(Failed Events / Total Events) × 100 - Normal: < 1%
- Warning: 1-5% (yellow)
- Critical: > 5% (red) - investigate immediately
Common causes of high error rate:
- Database connection issues
- External API failures (email service, QuickBooks, etc.)
- Memory exhaustion
- Code bugs in event handlers
Average Latency
- What it shows: Average time to process an event (in milliseconds)
- Normal: < 100ms for most events
- Warning: 100-500ms
- Critical: > 500ms
High latency indicates:
- Slow database queries
- External API delays
- High system load
- Inefficient event handlers
Event Rate Chart
What it shows: Events per minute over the last hour
How to read it:
- Steady line: Normal operation
- Spikes: High activity (e.g., bulk imports, marketing campaigns)
- Drops: Potential issues or low activity periods
- Pattern recognition: Look for daily/weekly patterns
Example scenarios:
- Morning spike at 9 AM: Users starting work
- Weekend dips: Lower business activity
- Sudden flatline: System outage
Latency Gauge
What it shows: Processing time percentiles
- P50 (50th percentile): Half of events process faster than this
- P95 (95th percentile): 95% of events process faster than this
- P99 (99th percentile): 99% of events process faster than this
Why percentiles matter:
- Average (mean) can be misleading - one slow event skews it
- P95/P99 show you the "worst case" experience
- If P99 is 5 seconds, 1% of users wait 5+ seconds
Target values:
- P50: < 50ms
- P95: < 200ms
- P99: < 500ms
Top Event Types
What it shows: The 5 most frequently emitted events
Example:
1. listing.created - 1,234 events (45%)
2. user.updated - 567 events (21%)
3. submission.converted - 234 events (9%)
4. order.created - 123 events (5%)
5. email.sent - 89 events (3%)
How to use this:
- Identify high-volume events that might need optimization
- Spot unusual patterns (e.g., 10x normal
user.loginevents = potential attack) - Plan capacity based on event volume
Recent Failed Events
What it shows: Last 10 events that failed to process
Columns:
- Event Type: What kind of event failed
- Error: Brief error message
- Time: When it failed
- Actions: Quick links to retry or view details
When to act:
- Any failed event should be investigated
- Multiple failures of same type = systematic issue
- Click "View in Dead Letter Queue" for detailed analysis
How to Use This Page
Daily Health Check (2 minutes)
- Check Error Rate - Should be < 1%
- Check Latency - P95 should be < 200ms
- Scan Recent Failures - Any new failures?
- Review Top Events - Any unusual volumes?
When Investigating Issues
Scenario 1: Users report slow performance
- Check Latency Gauge - are P95/P99 high?
- Check Event Rate Chart - spike in volume?
- Check Top Event Types - which events are slow?
- Click through to Event Metrics for detailed analysis
Scenario 2: Error alert triggered
- Note the error rate percentage
- Check Recent Failed Events section
- Click "View in Dead Letter Queue" for full details
- Identify pattern (same event type? same error message?)
- Retry failed events after fixing root cause
Scenario 3: Unusual activity detected
- Check Event Rate Chart for spikes
- Review Top Event Types for unexpected volumes
- Compare to historical patterns
- Investigate source (marketing campaign? bot traffic?)
Common Questions
Q: Why are some events showing as "failed"?
A: Events fail when:
- Database is temporarily unavailable
- External API (email, QuickBooks) times out
- Event handler has a bug
- System is under heavy load
Fix: Go to Dead Letter Queue to retry after resolving the issue.
Q: What does "P95 latency" mean?
A: 95% of events process faster than this time. If P95 is 200ms, then 95 out of 100 events complete in under 200ms.
Q: Why is my error rate 0% but users are complaining?
A: Check the Latency Gauge. Events might be succeeding but taking too long (slow performance vs. failures).
Q: Can I see individual events?
A: This dashboard shows aggregated statistics. For individual events with trace IDs and payloads, use the Event Flow page (Phase 3 will add database persistence for full event history).
Q: How far back does the data go?
A: Currently shows last hour of in-memory statistics. For historical trends, use the Event Metrics page with longer time ranges.
Best Practices
✅ Do
- Check this page daily as part of system health monitoring
- Investigate any error rate above 1%
- Use latency metrics to identify performance degradation
- Click through to Dead Letter Queue for failed event details
- Export metrics before system maintenance for baseline comparison
❌ Don't
- Ignore yellow/red indicators - they indicate real issues
- Retry failed events without fixing the root cause
- Assume zero errors means perfect performance (check latency too)
- Panic over single event failures - look for patterns
Related Pages
- Dead Letter Queue - Manage failed events
- Event Flow - Real-time event stream
- Event Metrics - Detailed analytics and trends
- System Alerts - Automated alerting setup
Need Help?
If you see:
- Error rate > 10%: Contact engineering immediately
- Latency P99 > 5 seconds: System under severe stress
- Zero events for > 5 minutes: Potential system outage
- Same error repeating: Check Dead Letter Queue for details