Event Metrics

Last updated: March 25, 2026
Admin Tools

Event Metrics

What is Event Metrics?

The Event Metrics page provides historical analytics and trend analysis for your event-driven architecture. While the Event System Dashboard shows current status and Event Flow shows real-time activity, Event Metrics answers questions like:

  • "Has our error rate been increasing over the past week?"
  • "What time of day do we process the most events?"
  • "Which event types are slowing down?"
  • "How did yesterday's outage affect our metrics?"

When to Use This Page

Performance Analysis: "Are events taking longer to process than last month?"

Capacity Planning: "Do we need to scale up before the holiday season?"

Incident Investigation: "What exactly happened during yesterday's outage?"

Reporting: "Generate a weekly report on system performance"

Trend Spotting: "Are we seeing more failures on weekends?"


Understanding the Interface

Time Range Selector

Choose the period to analyze:

RangeBest ForData Points
Last 15mImmediate post-incident15 one-minute points
Last 1hRecent trend check60 one-minute points
Last 6hHalf-day analysis72 five-minute points
Last 24hDaily patterns96 fifteen-minute points
Last 7dWeekly trends84 one-hour points

Tip: Longer ranges show patterns; shorter ranges show details.


Metric Cards Explained

Total Emitted

What it shows: Total number of events created in the selected time range

How to use it:

  • Baseline: Establish normal daily volume (e.g., "We typically emit 50,000 events/day")
  • Spike detection: 3x normal = unusual activity (marketing campaign? attack?)
  • Drop detection: Near zero = system outage

Example analysis:

Monday:    52,000 events  ← Normal
Tuesday:   48,000 events  ← Normal
Wednesday: 12,000 events  ← ⚠️ 75% drop - investigate!
Thursday:  155,000 events ← 🚨 3x spike - what happened?

Total Processed

What it shows: Events that completed successfully

Key metric: Processed / Emitted ratio

  • 100%: Perfect - all events succeeded
  • < 95%: Concerning - many failures
  • < 90%: Critical - investigate immediately

Formula: (Total Processed / Total Emitted) × 100


Total Failed

What it shows: Events that failed to process

Context matters:

  • 5 failures out of 100: 5% error rate (acceptable)
  • 5 failures out of 10: 50% error rate (critical)

Always check the ratio, not just the number!


Error Rate

What it shows: Percentage of events that failed

Color coding:

  • 🟢 Green (< 1%): Healthy system
  • 🟡 Yellow (1-5%): Warning - investigate
  • 🔴 Red (> 5%): Critical - immediate action needed

Industry benchmarks:

  • Excellent: < 0.1%
  • Good: 0.1% - 1%
  • Acceptable: 1% - 5%
  • Poor: > 5%

When error rate spikes:

  1. Check Dead Letter Queue for error details
  2. Identify affected event types
  3. Check system status (database, external APIs)
  4. Review recent deployments

Average Latency

What it shows: Average time to process an event (P50 median)

Why P50 (median) not average?

  • Average gets skewed by outliers
  • Median shows "typical" experience
  • 50% of events faster than this, 50% slower

Performance tiers:

  • 🟢 < 50ms: Excellent
  • 🟢 50-100ms: Good
  • 🟡 100-500ms: Acceptable but watch trends
  • 🔴 > 500ms: Poor - optimization needed

Latency trends to watch:

  • Gradual increase: System degrading, needs attention
  • Sudden spike: Specific issue (database slowdown, API timeout)
  • Consistent high latency: Architectural issue

Charts Deep Dive

Event Volume Over Time

What it shows: Events per time bucket (emitted, processed, failed)

How to read it:

Normal pattern:

Volume
  │    ╱╲      ╱╲      ╱╲
  │   ╱  ╲    ╱  ╲    ╱  ╲
  │  ╱    ╲  ╱    ╲  ╱    ╲
  │ ╱      ╲╱      ╲╱      ╲
  └─────────────────────────────
    Morning  Noon  Evening
  • Regular peaks during business hours
  • Steady baseline overnight

Problem patterns:

Sudden drop to zero:

Volume
  │    ╱╲
  │   ╱  ╲
  │  ╱    ╲_______
  │ ╱              ╲_______
  └─────────────────────────────
                ↑
         System outage!

Gradual increase:

Volume
  │
  │         ╱
  │       ╱
  │     ╱
  │   ╱
  │ ╱
  └─────────────────────────────
    Growing load - scale needed

Spike then return:

Volume
  │       ╱╲
  │      ╱  ╲
  │     ╱    ╲
  │    ╱      ╲____
  │   ╱            ╲___
  └─────────────────────────────
       ↑
    Marketing campaign

Latency Distribution

What it shows: Processing time percentiles (P50, P95, P99)

Understanding percentiles:

P50 (Median):

  • Half of events process faster than this
  • "Typical" user experience
  • Good for overall health

P95:

  • 95% of events process faster than this
  • "Worst case for most users"
  • Good for SLA monitoring

P99:

  • 99% of events process faster than this
  • "Worst case for almost everyone"
  • Good for identifying outliers

Example interpretation:

P50:  45ms   ← Most users have fast experience
P95:  120ms  ← 95% of users under 120ms (good SLA)
P99:  350ms  ← 1% of users wait 350ms+ (investigate)

When to worry:

  • P95 > 500ms: Most users feeling slowness
  • P99 > 2s: Some users having terrible experience
  • Gap between P50 and P99 growing: Inconsistent performance

Error Breakdown by Event Type

What it shows: Which event types are failing most

How to use it:

Identify problematic handlers:

1. listing.created     ████████████ 45% of errors
2. email.sent         ██████ 23% of errors
3. user.updated       ███ 12% of errors

→ Focus engineering effort on listing.created handler

Detect patterns:

  • One event type dominating errors = specific bug
  • Multiple event types failing = systemic issue (database, network)
  • New event type appearing = recently introduced bug

Prioritization:

  1. Fix high-volume, high-error events first
  2. Low-volume events can wait
  3. Consider disabling problematic event types temporarily

How to Use This Page

Daily Health Check (5 minutes)

  1. Set time range: Last 24 hours
  2. Check Error Rate: Should be < 1%
  3. Check Latency: P95 should be < 200ms
  4. Review Error Breakdown: Any new event types failing?
  5. Check Volume Chart: Normal daily pattern?

Red flags:

  • Error rate > 5%
  • P99 latency > 1 second
  • Volume chart shows unexpected drops/spikes
  • New event type in error breakdown

Weekly Analysis (15 minutes)

  1. Set time range: Last 7 days
  2. Export data (JSON or CSV)
  3. Compare to previous week:
    • Volume up/down?
    • Error rate trending?
    • Latency improving/degrading?
  4. Identify patterns:
    • Weekend vs weekday differences
    • Peak hours
    • Growth trends
  5. Report findings to team

Questions to answer:

  • Are we growing? (volume trend)
  • Is reliability improving? (error rate trend)
  • Is performance degrading? (latency trend)
  • What should we fix first? (error breakdown)

Post-Incident Analysis

Scenario: Yesterday 2-3 PM, system had outage

  1. Set time range: Last 24 hours
  2. Focus on incident window: Look at 2-3 PM on chart
  3. Document the impact:
    • Volume dropped to zero?
    • Error rate spiked?
    • Latency increased?
  4. Export metrics for incident report
  5. Calculate recovery time: When did metrics return to normal?

Export for incident report:

  • Screenshot of volume chart showing outage
  • Error rate during incident
  • Events affected count
  • Recovery timeline

Capacity Planning

Question: "Do we need to scale up for holiday season?"

  1. Set time range: Last 7 days
  2. Note peak volume: What's our highest event rate?
  3. Check latency during peaks: Does performance degrade?
  4. Review error rates under load: More failures during busy times?
  5. Project growth: If volume doubles, will we handle it?

Decision criteria:

  • Latency increases > 50% during peaks → Scale up
  • Error rate increases during peaks → Scale up
  • Current utilization > 70% → Plan scaling

Exporting Data

When to Export

JSON Export:

  • Feeding into other analytics tools
  • Programmatic processing
  • Detailed analysis in Python/R

CSV Export:

  • Spreadsheets and pivot tables
  • Management reports
  • Sharing with non-technical stakeholders
  • Creating charts in Excel/Google Sheets

Export Contents

Includes:

  • Time range and deployment
  • All metric values
  • Event type breakdowns
  • Timestamp of export

Does NOT include:

  • Individual event details (use Dead Letter Queue)
  • User information
  • Payload data

Common Analysis Patterns

Pattern 1: "Are we getting slower over time?"

Method:

  1. Set range to Last 7 days
  2. Look at Latency Distribution gauge
  3. Compare P50, P95, P99
  4. Export data weekly and track trends

Interpretation:

  • All percentiles increasing = systematic slowdown
  • Only P99 increasing = outlier problem
  • P50 stable, P95/P99 up = inconsistent performance

Pattern 2: "What caused the error spike?"

Method:

  1. Set range to time of incident
  2. Check Error Rate card
  3. Review Error Breakdown chart
  4. Identify which event type spiked
  5. Cross-reference with Dead Letter Queue

Interpretation:

  • Single event type = specific handler bug
  • All event types = infrastructure issue
  • Correlates with deployment = code regression

Pattern 3: "When should we schedule maintenance?"

Method:

  1. Set range to Last 7 days
  2. Review Event Volume chart
  3. Identify lowest activity periods
  4. Check Error Breakdown for low-traffic times

Interpretation:

  • Lowest volume = safest maintenance window
  • Consistent low periods = predictable windows
  • Avoid times with high error rates

Best Practices

✅ Do

  • Check daily with 24-hour view
  • Export weekly for trend tracking
  • Compare time ranges (this week vs last week)
  • Focus on trends not single data points
  • Use for capacity planning before growth periods
  • Document baselines ("normal" error rate, latency)
  • Share reports with engineering team

❌ Don't

  • Don't panic over single spikes - look for patterns
  • Don't ignore gradual degradation - 10% worse per week adds up
  • Don't compare different time ranges - 15m vs 7d have different scales
  • Don't forget to export before data ages out
  • Don't use for real-time debugging - use Event Flow instead

Troubleshooting

Q: Why are my charts empty?

A:

  • Check time range (too short? too long?)
  • Verify deployment filter
  • System may have been down during selected period
  • Try different time range

Q: Error rate shows 0% but users are complaining

A: Check Latency Distribution. Events might be succeeding but taking too long.

Q: Can I see metrics from last month?

A: Currently limited to 7 days (in-memory storage). Phase 3 will add database persistence for longer history.

Q: What's the difference between P50 and average?

A: P50 (median) is more representative. Average gets skewed by outliers. If 99 events take 10ms and 1 event takes 10s, average is ~100ms but P50 is 10ms.

Q: Why does the 7-day view look different than 24-hour view?

A: Different granularity. 7-day view aggregates into hourly buckets, 24-hour view shows 15-minute buckets. Spikes may appear smoothed out in longer views.


Metrics Glossary

TermDefinitionWhy It Matters
EmittedEvent created and sent to queueVolume indicator
ProcessedEvent handled successfullySuccess metric
FailedEvent handler threw errorReliability metric
Error RateFailed / Emitted × 100Health indicator
LatencyTime to process eventPerformance metric
P5050th percentile (median)Typical experience
P9595th percentileSLA threshold
P9999th percentileOutlier detection
ThroughputEvents per secondCapacity metric

Related Pages


Quick Reference Card

EVENT METRICS QUICK GUIDE

📊 Purpose: Historical analysis and trends

⏱️ Time Ranges:
• 15m - Post-incident analysis
• 1h  - Recent health check
• 24h - Daily patterns
• 7d  - Weekly trends

📈 Key Metrics:
• Error Rate < 1% = Healthy
• P95 Latency < 200ms = Good
• Volume trend = Growth indicator

🎯 Analysis Patterns:
• Compare to baseline
• Look for trends, not spikes
• Export for reporting
• Focus on worst event types

📤 Export: JSON (analysis) | CSV (reports)

Remember: Event Metrics is your "historian" - use it to understand patterns, plan capacity, and report on system health over time!

Was this article helpful?

Your feedback helps us improve our support content.

Still need assistance?

Our support team is ready to help you with more complex issues.

Contact Support