Event Metrics

Last updated: March 25, 2026

Admin Tools

Event Metrics

What is Event Metrics?

The Event Metrics page provides historical analytics and trend analysis for your event-driven architecture. While the Event System Dashboard shows current status and Event Flow shows real-time activity, Event Metrics answers questions like:

"Has our error rate been increasing over the past week?"
"What time of day do we process the most events?"
"Which event types are slowing down?"
"How did yesterday's outage affect our metrics?"

When to Use This Page

Performance Analysis: "Are events taking longer to process than last month?"

Capacity Planning: "Do we need to scale up before the holiday season?"

Incident Investigation: "What exactly happened during yesterday's outage?"

Reporting: "Generate a weekly report on system performance"

Trend Spotting: "Are we seeing more failures on weekends?"

Understanding the Interface

Time Range Selector

Choose the period to analyze:

Range	Best For	Data Points
Last 15m	Immediate post-incident	15 one-minute points
Last 1h	Recent trend check	60 one-minute points
Last 6h	Half-day analysis	72 five-minute points
Last 24h	Daily patterns	96 fifteen-minute points
Last 7d	Weekly trends	84 one-hour points

Tip: Longer ranges show patterns; shorter ranges show details.

Metric Cards Explained

Total Emitted

What it shows: Total number of events created in the selected time range

How to use it:

Baseline: Establish normal daily volume (e.g., "We typically emit 50,000 events/day")
Spike detection: 3x normal = unusual activity (marketing campaign? attack?)
Drop detection: Near zero = system outage

Example analysis:

Monday:    52,000 events  ← Normal
Tuesday:   48,000 events  ← Normal
Wednesday: 12,000 events  ← ⚠️ 75% drop - investigate!
Thursday:  155,000 events ← 🚨 3x spike - what happened?

Total Processed

What it shows: Events that completed successfully

Key metric: Processed / Emitted ratio

100%: Perfect - all events succeeded
< 95%: Concerning - many failures
< 90%: Critical - investigate immediately

Formula: (Total Processed / Total Emitted) × 100

Total Failed

What it shows: Events that failed to process

Context matters:

5 failures out of 100: 5% error rate (acceptable)
5 failures out of 10: 50% error rate (critical)

Always check the ratio, not just the number!

Error Rate

What it shows: Percentage of events that failed

Color coding:

🟢 Green (< 1%): Healthy system
🟡 Yellow (1-5%): Warning - investigate
🔴 Red (> 5%): Critical - immediate action needed

Industry benchmarks:

Excellent: < 0.1%
Good: 0.1% - 1%
Acceptable: 1% - 5%
Poor: > 5%

When error rate spikes:

Check Dead Letter Queue for error details
Identify affected event types
Check system status (database, external APIs)
Review recent deployments

Average Latency

What it shows: Average time to process an event (P50 median)

Why P50 (median) not average?

Average gets skewed by outliers
Median shows "typical" experience
50% of events faster than this, 50% slower

Performance tiers:

🟢 < 50ms: Excellent
🟢 50-100ms: Good
🟡 100-500ms: Acceptable but watch trends
🔴 > 500ms: Poor - optimization needed

Latency trends to watch:

Gradual increase: System degrading, needs attention
Sudden spike: Specific issue (database slowdown, API timeout)
Consistent high latency: Architectural issue

Charts Deep Dive

Event Volume Over Time

What it shows: Events per time bucket (emitted, processed, failed)

How to read it:

Normal pattern:

Volume
  │    ╱╲      ╱╲      ╱╲
  │   ╱  ╲    ╱  ╲    ╱  ╲
  │  ╱    ╲  ╱    ╲  ╱    ╲
  │ ╱      ╲╱      ╲╱      ╲
  └─────────────────────────────
    Morning  Noon  Evening

Regular peaks during business hours
Steady baseline overnight

Problem patterns:

Sudden drop to zero:

Volume
  │    ╱╲
  │   ╱  ╲
  │  ╱    ╲_______
  │ ╱              ╲_______
  └─────────────────────────────
                ↑
         System outage!

Gradual increase:

Volume
  │
  │         ╱
  │       ╱
  │     ╱
  │   ╱
  │ ╱
  └─────────────────────────────
    Growing load - scale needed

Spike then return:

Volume
  │       ╱╲
  │      ╱  ╲
  │     ╱    ╲
  │    ╱      ╲____
  │   ╱            ╲___
  └─────────────────────────────
       ↑
    Marketing campaign

Latency Distribution

What it shows: Processing time percentiles (P50, P95, P99)

Understanding percentiles:

P50 (Median):

Half of events process faster than this
"Typical" user experience
Good for overall health

P95:

95% of events process faster than this
"Worst case for most users"
Good for SLA monitoring

P99:

99% of events process faster than this
"Worst case for almost everyone"
Good for identifying outliers

Example interpretation:

P50:  45ms   ← Most users have fast experience
P95:  120ms  ← 95% of users under 120ms (good SLA)
P99:  350ms  ← 1% of users wait 350ms+ (investigate)

When to worry:

P95 > 500ms: Most users feeling slowness
P99 > 2s: Some users having terrible experience
Gap between P50 and P99 growing: Inconsistent performance

Error Breakdown by Event Type

What it shows: Which event types are failing most

How to use it:

Identify problematic handlers:

1. listing.created     ████████████ 45% of errors
2. email.sent         ██████ 23% of errors
3. user.updated       ███ 12% of errors

→ Focus engineering effort on listing.created handler

Detect patterns:

One event type dominating errors = specific bug
Multiple event types failing = systemic issue (database, network)
New event type appearing = recently introduced bug

Prioritization:

Fix high-volume, high-error events first
Low-volume events can wait
Consider disabling problematic event types temporarily

How to Use This Page

Daily Health Check (5 minutes)

Set time range: Last 24 hours
Check Error Rate: Should be < 1%
Check Latency: P95 should be < 200ms
Review Error Breakdown: Any new event types failing?
Check Volume Chart: Normal daily pattern?

Red flags:

Error rate > 5%
P99 latency > 1 second
Volume chart shows unexpected drops/spikes
New event type in error breakdown

Weekly Analysis (15 minutes)

Set time range: Last 7 days
Export data (JSON or CSV)
Compare to previous week:
- Volume up/down?
- Error rate trending?
- Latency improving/degrading?
Identify patterns:
- Weekend vs weekday differences
- Peak hours
- Growth trends
Report findings to team

Questions to answer:

Are we growing? (volume trend)
Is reliability improving? (error rate trend)
Is performance degrading? (latency trend)
What should we fix first? (error breakdown)

Post-Incident Analysis

Scenario: Yesterday 2-3 PM, system had outage

Set time range: Last 24 hours
Focus on incident window: Look at 2-3 PM on chart
Document the impact:
- Volume dropped to zero?
- Error rate spiked?
- Latency increased?
Export metrics for incident report
Calculate recovery time: When did metrics return to normal?

Export for incident report:

Screenshot of volume chart showing outage
Error rate during incident
Events affected count
Recovery timeline

Capacity Planning

Question: "Do we need to scale up for holiday season?"

Set time range: Last 7 days
Note peak volume: What's our highest event rate?
Check latency during peaks: Does performance degrade?
Review error rates under load: More failures during busy times?
Project growth: If volume doubles, will we handle it?

Decision criteria:

Latency increases > 50% during peaks → Scale up
Error rate increases during peaks → Scale up
Current utilization > 70% → Plan scaling

Exporting Data

When to Export

JSON Export:

Feeding into other analytics tools
Programmatic processing
Detailed analysis in Python/R

CSV Export:

Spreadsheets and pivot tables
Management reports
Sharing with non-technical stakeholders
Creating charts in Excel/Google Sheets

Export Contents

Includes:

Time range and deployment
All metric values
Event type breakdowns
Timestamp of export

Does NOT include:

Individual event details (use Dead Letter Queue)
User information
Payload data

Common Analysis Patterns

Pattern 1: "Are we getting slower over time?"

Method:

Set range to Last 7 days
Look at Latency Distribution gauge
Compare P50, P95, P99
Export data weekly and track trends

Interpretation:

All percentiles increasing = systematic slowdown
Only P99 increasing = outlier problem
P50 stable, P95/P99 up = inconsistent performance

Pattern 2: "What caused the error spike?"

Method:

Set range to time of incident
Check Error Rate card
Review Error Breakdown chart
Identify which event type spiked
Cross-reference with Dead Letter Queue

Interpretation:

Single event type = specific handler bug
All event types = infrastructure issue
Correlates with deployment = code regression

Pattern 3: "When should we schedule maintenance?"

Method:

Set range to Last 7 days
Review Event Volume chart
Identify lowest activity periods
Check Error Breakdown for low-traffic times

Interpretation:

Lowest volume = safest maintenance window
Consistent low periods = predictable windows
Avoid times with high error rates

Best Practices

✅ Do

Check daily with 24-hour view
Export weekly for trend tracking
Compare time ranges (this week vs last week)
Focus on trends not single data points
Use for capacity planning before growth periods
Document baselines ("normal" error rate, latency)
Share reports with engineering team

❌ Don't

Don't panic over single spikes - look for patterns
Don't ignore gradual degradation - 10% worse per week adds up
Don't compare different time ranges - 15m vs 7d have different scales
Don't forget to export before data ages out
Don't use for real-time debugging - use Event Flow instead

Troubleshooting

Q: Why are my charts empty?

Check time range (too short? too long?)
Verify deployment filter
System may have been down during selected period
Try different time range

Q: Error rate shows 0% but users are complaining

A: Check Latency Distribution. Events might be succeeding but taking too long.

Q: Can I see metrics from last month?

A: Currently limited to 7 days (in-memory storage). Phase 3 will add database persistence for longer history.

Q: What's the difference between P50 and average?

A: P50 (median) is more representative. Average gets skewed by outliers. If 99 events take 10ms and 1 event takes 10s, average is ~100ms but P50 is 10ms.

Q: Why does the 7-day view look different than 24-hour view?

A: Different granularity. 7-day view aggregates into hourly buckets, 24-hour view shows 15-minute buckets. Spikes may appear smoothed out in longer views.

Metrics Glossary

Term	Definition	Why It Matters
Emitted	Event created and sent to queue	Volume indicator
Processed	Event handled successfully	Success metric
Failed	Event handler threw error	Reliability metric
Error Rate	Failed / Emitted × 100	Health indicator
Latency	Time to process event	Performance metric
P50	50th percentile (median)	Typical experience
P95	95th percentile	SLA threshold
P99	99th percentile	Outlier detection
Throughput	Events per second	Capacity metric

Event System Dashboard - Current status overview
Event Flow - Real-time event monitoring
Dead Letter Queue - Failed event management
System Alerts - Automated monitoring

Quick Reference Card

EVENT METRICS QUICK GUIDE

📊 Purpose: Historical analysis and trends

⏱️ Time Ranges:
• 15m - Post-incident analysis
• 1h  - Recent health check
• 24h - Daily patterns
• 7d  - Weekly trends

📈 Key Metrics:
• Error Rate < 1% = Healthy
• P95 Latency < 200ms = Good
• Volume trend = Growth indicator

🎯 Analysis Patterns:
• Compare to baseline
• Look for trends, not spikes
• Export for reporting
• Focus on worst event types

📤 Export: JSON (analysis) | CSV (reports)

Remember: Event Metrics is your "historian" - use it to understand patterns, plan capacity, and report on system health over time!

Event Metrics

What is Event Metrics?

When to Use This Page

Understanding the Interface

Time Range Selector

Metric Cards Explained

Total Emitted

Total Processed

Total Failed

Error Rate

Average Latency

Charts Deep Dive

Event Volume Over Time

Latency Distribution

Error Breakdown by Event Type

How to Use This Page

Daily Health Check (5 minutes)

Weekly Analysis (15 minutes)

Post-Incident Analysis

Capacity Planning

Exporting Data

When to Export

Export Contents

Common Analysis Patterns

Pattern 1: "Are we getting slower over time?"

Pattern 2: "What caused the error spike?"

Pattern 3: "When should we schedule maintenance?"

Best Practices

✅ Do

❌ Don't

Troubleshooting

Q: Why are my charts empty?

Q: Error rate shows 0% but users are complaining

Q: Can I see metrics from last month?

Q: What's the difference between P50 and average?

Q: Why does the 7-day view look different than 24-hour view?

Metrics Glossary

Related Pages

Quick Reference Card