Event Metrics
What is Event Metrics?
The Event Metrics page provides historical analytics and trend analysis for your event-driven architecture. While the Event System Dashboard shows current status and Event Flow shows real-time activity, Event Metrics answers questions like:
- "Has our error rate been increasing over the past week?"
- "What time of day do we process the most events?"
- "Which event types are slowing down?"
- "How did yesterday's outage affect our metrics?"
When to Use This Page
Performance Analysis: "Are events taking longer to process than last month?"
Capacity Planning: "Do we need to scale up before the holiday season?"
Incident Investigation: "What exactly happened during yesterday's outage?"
Reporting: "Generate a weekly report on system performance"
Trend Spotting: "Are we seeing more failures on weekends?"
Understanding the Interface
Time Range Selector
Choose the period to analyze:
| Range | Best For | Data Points |
|---|---|---|
| Last 15m | Immediate post-incident | 15 one-minute points |
| Last 1h | Recent trend check | 60 one-minute points |
| Last 6h | Half-day analysis | 72 five-minute points |
| Last 24h | Daily patterns | 96 fifteen-minute points |
| Last 7d | Weekly trends | 84 one-hour points |
Tip: Longer ranges show patterns; shorter ranges show details.
Metric Cards Explained
Total Emitted
What it shows: Total number of events created in the selected time range
How to use it:
- Baseline: Establish normal daily volume (e.g., "We typically emit 50,000 events/day")
- Spike detection: 3x normal = unusual activity (marketing campaign? attack?)
- Drop detection: Near zero = system outage
Example analysis:
Monday: 52,000 events ← Normal
Tuesday: 48,000 events ← Normal
Wednesday: 12,000 events ← ⚠️ 75% drop - investigate!
Thursday: 155,000 events ← 🚨 3x spike - what happened?
Total Processed
What it shows: Events that completed successfully
Key metric: Processed / Emitted ratio
- 100%: Perfect - all events succeeded
- < 95%: Concerning - many failures
- < 90%: Critical - investigate immediately
Formula: (Total Processed / Total Emitted) × 100
Total Failed
What it shows: Events that failed to process
Context matters:
- 5 failures out of 100: 5% error rate (acceptable)
- 5 failures out of 10: 50% error rate (critical)
Always check the ratio, not just the number!
Error Rate
What it shows: Percentage of events that failed
Color coding:
- 🟢 Green (< 1%): Healthy system
- 🟡 Yellow (1-5%): Warning - investigate
- 🔴 Red (> 5%): Critical - immediate action needed
Industry benchmarks:
- Excellent: < 0.1%
- Good: 0.1% - 1%
- Acceptable: 1% - 5%
- Poor: > 5%
When error rate spikes:
- Check Dead Letter Queue for error details
- Identify affected event types
- Check system status (database, external APIs)
- Review recent deployments
Average Latency
What it shows: Average time to process an event (P50 median)
Why P50 (median) not average?
- Average gets skewed by outliers
- Median shows "typical" experience
- 50% of events faster than this, 50% slower
Performance tiers:
- 🟢 < 50ms: Excellent
- 🟢 50-100ms: Good
- 🟡 100-500ms: Acceptable but watch trends
- 🔴 > 500ms: Poor - optimization needed
Latency trends to watch:
- Gradual increase: System degrading, needs attention
- Sudden spike: Specific issue (database slowdown, API timeout)
- Consistent high latency: Architectural issue
Charts Deep Dive
Event Volume Over Time
What it shows: Events per time bucket (emitted, processed, failed)
How to read it:
Normal pattern:
Volume
│ ╱╲ ╱╲ ╱╲
│ ╱ ╲ ╱ ╲ ╱ ╲
│ ╱ ╲ ╱ ╲ ╱ ╲
│ ╱ ╲╱ ╲╱ ╲
└─────────────────────────────
Morning Noon Evening
- Regular peaks during business hours
- Steady baseline overnight
Problem patterns:
Sudden drop to zero:
Volume
│ ╱╲
│ ╱ ╲
│ ╱ ╲_______
│ ╱ ╲_______
└─────────────────────────────
↑
System outage!
Gradual increase:
Volume
│
│ ╱
│ ╱
│ ╱
│ ╱
│ ╱
└─────────────────────────────
Growing load - scale needed
Spike then return:
Volume
│ ╱╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲____
│ ╱ ╲___
└─────────────────────────────
↑
Marketing campaign
Latency Distribution
What it shows: Processing time percentiles (P50, P95, P99)
Understanding percentiles:
P50 (Median):
- Half of events process faster than this
- "Typical" user experience
- Good for overall health
P95:
- 95% of events process faster than this
- "Worst case for most users"
- Good for SLA monitoring
P99:
- 99% of events process faster than this
- "Worst case for almost everyone"
- Good for identifying outliers
Example interpretation:
P50: 45ms ← Most users have fast experience
P95: 120ms ← 95% of users under 120ms (good SLA)
P99: 350ms ← 1% of users wait 350ms+ (investigate)
When to worry:
- P95 > 500ms: Most users feeling slowness
- P99 > 2s: Some users having terrible experience
- Gap between P50 and P99 growing: Inconsistent performance
Error Breakdown by Event Type
What it shows: Which event types are failing most
How to use it:
Identify problematic handlers:
1. listing.created ████████████ 45% of errors
2. email.sent ██████ 23% of errors
3. user.updated ███ 12% of errors
→ Focus engineering effort on listing.created handler
Detect patterns:
- One event type dominating errors = specific bug
- Multiple event types failing = systemic issue (database, network)
- New event type appearing = recently introduced bug
Prioritization:
- Fix high-volume, high-error events first
- Low-volume events can wait
- Consider disabling problematic event types temporarily
How to Use This Page
Daily Health Check (5 minutes)
- Set time range: Last 24 hours
- Check Error Rate: Should be < 1%
- Check Latency: P95 should be < 200ms
- Review Error Breakdown: Any new event types failing?
- Check Volume Chart: Normal daily pattern?
Red flags:
- Error rate > 5%
- P99 latency > 1 second
- Volume chart shows unexpected drops/spikes
- New event type in error breakdown
Weekly Analysis (15 minutes)
- Set time range: Last 7 days
- Export data (JSON or CSV)
- Compare to previous week:
- Volume up/down?
- Error rate trending?
- Latency improving/degrading?
- Identify patterns:
- Weekend vs weekday differences
- Peak hours
- Growth trends
- Report findings to team
Questions to answer:
- Are we growing? (volume trend)
- Is reliability improving? (error rate trend)
- Is performance degrading? (latency trend)
- What should we fix first? (error breakdown)
Post-Incident Analysis
Scenario: Yesterday 2-3 PM, system had outage
- Set time range: Last 24 hours
- Focus on incident window: Look at 2-3 PM on chart
- Document the impact:
- Volume dropped to zero?
- Error rate spiked?
- Latency increased?
- Export metrics for incident report
- Calculate recovery time: When did metrics return to normal?
Export for incident report:
- Screenshot of volume chart showing outage
- Error rate during incident
- Events affected count
- Recovery timeline
Capacity Planning
Question: "Do we need to scale up for holiday season?"
- Set time range: Last 7 days
- Note peak volume: What's our highest event rate?
- Check latency during peaks: Does performance degrade?
- Review error rates under load: More failures during busy times?
- Project growth: If volume doubles, will we handle it?
Decision criteria:
- Latency increases > 50% during peaks → Scale up
- Error rate increases during peaks → Scale up
- Current utilization > 70% → Plan scaling
Exporting Data
When to Export
JSON Export:
- Feeding into other analytics tools
- Programmatic processing
- Detailed analysis in Python/R
CSV Export:
- Spreadsheets and pivot tables
- Management reports
- Sharing with non-technical stakeholders
- Creating charts in Excel/Google Sheets
Export Contents
Includes:
- Time range and deployment
- All metric values
- Event type breakdowns
- Timestamp of export
Does NOT include:
- Individual event details (use Dead Letter Queue)
- User information
- Payload data
Common Analysis Patterns
Pattern 1: "Are we getting slower over time?"
Method:
- Set range to Last 7 days
- Look at Latency Distribution gauge
- Compare P50, P95, P99
- Export data weekly and track trends
Interpretation:
- All percentiles increasing = systematic slowdown
- Only P99 increasing = outlier problem
- P50 stable, P95/P99 up = inconsistent performance
Pattern 2: "What caused the error spike?"
Method:
- Set range to time of incident
- Check Error Rate card
- Review Error Breakdown chart
- Identify which event type spiked
- Cross-reference with Dead Letter Queue
Interpretation:
- Single event type = specific handler bug
- All event types = infrastructure issue
- Correlates with deployment = code regression
Pattern 3: "When should we schedule maintenance?"
Method:
- Set range to Last 7 days
- Review Event Volume chart
- Identify lowest activity periods
- Check Error Breakdown for low-traffic times
Interpretation:
- Lowest volume = safest maintenance window
- Consistent low periods = predictable windows
- Avoid times with high error rates
Best Practices
✅ Do
- Check daily with 24-hour view
- Export weekly for trend tracking
- Compare time ranges (this week vs last week)
- Focus on trends not single data points
- Use for capacity planning before growth periods
- Document baselines ("normal" error rate, latency)
- Share reports with engineering team
❌ Don't
- Don't panic over single spikes - look for patterns
- Don't ignore gradual degradation - 10% worse per week adds up
- Don't compare different time ranges - 15m vs 7d have different scales
- Don't forget to export before data ages out
- Don't use for real-time debugging - use Event Flow instead
Troubleshooting
Q: Why are my charts empty?
A:
- Check time range (too short? too long?)
- Verify deployment filter
- System may have been down during selected period
- Try different time range
Q: Error rate shows 0% but users are complaining
A: Check Latency Distribution. Events might be succeeding but taking too long.
Q: Can I see metrics from last month?
A: Currently limited to 7 days (in-memory storage). Phase 3 will add database persistence for longer history.
Q: What's the difference between P50 and average?
A: P50 (median) is more representative. Average gets skewed by outliers. If 99 events take 10ms and 1 event takes 10s, average is ~100ms but P50 is 10ms.
Q: Why does the 7-day view look different than 24-hour view?
A: Different granularity. 7-day view aggregates into hourly buckets, 24-hour view shows 15-minute buckets. Spikes may appear smoothed out in longer views.
Metrics Glossary
| Term | Definition | Why It Matters |
|---|---|---|
| Emitted | Event created and sent to queue | Volume indicator |
| Processed | Event handled successfully | Success metric |
| Failed | Event handler threw error | Reliability metric |
| Error Rate | Failed / Emitted × 100 | Health indicator |
| Latency | Time to process event | Performance metric |
| P50 | 50th percentile (median) | Typical experience |
| P95 | 95th percentile | SLA threshold |
| P99 | 99th percentile | Outlier detection |
| Throughput | Events per second | Capacity metric |
Related Pages
- Event System Dashboard - Current status overview
- Event Flow - Real-time event monitoring
- Dead Letter Queue - Failed event management
- System Alerts - Automated monitoring
Quick Reference Card
EVENT METRICS QUICK GUIDE
📊 Purpose: Historical analysis and trends
⏱️ Time Ranges:
• 15m - Post-incident analysis
• 1h - Recent health check
• 24h - Daily patterns
• 7d - Weekly trends
📈 Key Metrics:
• Error Rate < 1% = Healthy
• P95 Latency < 200ms = Good
• Volume trend = Growth indicator
🎯 Analysis Patterns:
• Compare to baseline
• Look for trends, not spikes
• Export for reporting
• Focus on worst event types
📤 Export: JSON (analysis) | CSV (reports)
Remember: Event Metrics is your "historian" - use it to understand patterns, plan capacity, and report on system health over time!