Dead Letter Queue
What is the Dead Letter Queue?
The Dead Letter Queue (DLQ) is a holding area for events that failed to process successfully. Think of it like a "problem bin" where failed events go so they don't get lost - you can review them, fix issues, and retry them later.
Why Events Fail
Events end up in the DLQ when:
- Database is temporarily down - Event can't save data
- External API fails - Email service, QuickBooks, or other integrations timeout
- Code bugs - Event handler crashes due to unexpected data
- Network issues - Can't reach required services
- Resource exhaustion - System too busy to process
The DLQ Lifecycle
Event Emitted → Processing Attempt → FAILED → Dead Letter Queue
↓
┌─────────────────────────────────────┼─────────────────┐
↓ ↓ ↓
[RETRY] [RESOLVE] [DISCARD]
↓ ↓ ↓
Try processing again Mark as handled Delete permanently
(after fixing issue) (won't retry) (data loss)
Understanding the Interface
Main Table Columns
| Column | What It Means | Example |
|---|---|---|
| Event Type | What kind of event failed | listing.created, email.sent |
| Error | Why it failed (short version) | Connection timeout, Null pointer |
| Listener | Which handler failed | EmailNotificationListener |
| Retries | How many times we tried | 2/3 (2 attempts, 3 max) |
| Created | When the event originally happened | Mar 25, 2026 2:30 PM |
| Actions | What you can do | Retry, Resolve, Discard |
Status Indicators
- 🔴 Red badge: Failed and unresolved (needs attention)
- 🟡 Yellow badge: Failed but retry scheduled
- 🟢 Green badge: Resolved (manually marked as handled)
How to Use This Page
Scenario 1: Email Service Was Down
Problem: External email API was down for 30 minutes, 50 email.sent events
failed.
Solution:
- Go to Dead Letter Queue
- Filter by Event Type:
email.sent - Select all 50 events (checkboxes)
- Click "Retry Selected"
- Confirm in dialog
- Events will be re-processed (emails will send)
Result: ✅ All emails sent successfully
Scenario 2: Database Connection Issue
Problem: Database was restarting, multiple event types failed.
Solution:
- Check that database is back online
- Go to Dead Letter Queue
- Filter by "Unresolved Only"
- Click "Retry All Unresolved" button
- Monitor Event System Dashboard for error rate drop
Result: ✅ Events process normally now
Scenario 3: Bug in Event Handler
Problem: listing.created events failing with
TypeError: Cannot read property 'id' of null
Solution:
- Don't retry yet! - It will just fail again
- Click on failed event to see full error
- Note the error pattern
- Report to engineering with:
- Event type:
listing.created - Error:
TypeError: Cannot read property 'id' of null - Count: 23 events affected
- Time range: Last 2 hours
- Event type:
- Wait for code fix
- After fix deployed, retry failed events
Result: ✅ Events process successfully after bug fix
Scenario 4: One-Time Data Issue
Problem: Single event failed due to bad data that can't be fixed.
Solution:
- Review event details (click to expand)
- Confirm it's an isolated issue
- Click "Resolve" to mark as handled
- Event stays in queue but marked resolved
Result: ✅ Queue cleaned up, event won't retry
Action Buttons Explained
🔁 Retry
What it does: Attempts to process the event again
When to use:
- Temporary issue resolved (database back up, API restored)
- Code bug fixed
- Network connectivity restored
When NOT to use:
- Issue not fixed yet (will just fail again)
- Data is permanently bad
Bulk retry: Select multiple events with checkboxes, then "Retry Selected"
✓ Resolve
What it does: Marks event as "handled" - won't retry, stays in queue for record
When to use:
- One-time data issue that can't be fixed
- Event is no longer relevant (expired, superseded)
- You've handled the issue manually outside the system
When NOT to use:
- Issue is fixable (use Retry instead)
- You want to delete the record (use Discard)
🗑️ Discard
What it does: Permanently deletes the event from the queue
⚠️ WARNING: This is irreversible! Data is lost.
When to use:
- Confirmed the event is garbage (test data, duplicate)
- Storage space concerns (very large queue)
- Privacy/GDPR compliance (must delete)
When NOT to use:
- You might need the event later (use Resolve instead)
- Not sure what the event is
- Production events (unless confirmed safe)
Filtering and Search
Filter Options
Event Type: Show only specific events
- Example:
listing.createdto see only listing creation failures
Date Range: Events from specific time period
- Useful for: "Show me yesterday's failures"
Retry Count: Events by number of attempts
- High retry count = persistent issue
Unresolved Only: Hide already-resolved events
- Default view - shows what needs attention
Search
Search by:
- Error message text
- Trace ID (if you have it from logs)
- Listener name
Example searches:
timeout- Find all timeout errorsQuickBooks- Find QuickBooks integration failurestrace-abc-123- Find specific event by trace ID
Reading Error Messages
Common Errors and Solutions
| Error | Meaning | Solution |
|---|---|---|
ECONNREFUSED | Can't connect to service | Check if database/external API is up |
ETIMEDOUT | Connection timed out | Service slow/overloaded, retry later |
404 Not Found | Resource doesn't exist | Data inconsistency, may need manual fix |
500 Internal Server Error | External service crashed | Wait for external service to recover |
TypeError: Cannot read property | Code bug | Report to engineering, don't retry |
Validation failed | Bad data | Check data format, may need manual correction |
Rate limit exceeded | Too many requests | Wait and retry, or spread out load |
Best Practices
✅ Do
- Check DLQ daily as part of system health routine
- Retry in bulk after known outages (database restart, API maintenance)
- Filter by event type to identify systematic issues
- Export to CSV for analysis or reporting
- Resolve events after handling them to keep queue clean
- Monitor retry counts - events with 3+ retries need investigation
❌ Don't
- Don't ignore the queue - failed events indicate real problems
- Don't retry before fixing - wastes resources, creates noise
- Don't discard without understanding - you might lose important data
- Don't resolve without action - just hides the problem
- Don't panic over single failures - look for patterns and volume
Metrics to Watch
Queue Size
- Normal: 0-10 events
- Warning: 10-50 events (investigate)
- Critical: 50+ events (urgent action needed)
Age of Oldest Event
- Normal: < 1 hour
- Warning: 1-24 hours
- Critical: > 24 hours (stale events may be irrelevant)
Event Type Distribution
If 90% of failures are one event type:
- That handler likely has a bug
- Prioritize fixing that specific issue
Exporting Data
Export to CSV when:
- Reporting to management
- Analyzing patterns in spreadsheet
- Sharing with engineering team
- Creating incident documentation
Export includes:
- Event type
- Error message
- Timestamp
- Retry count
- Listener name
Common Workflows
Daily Health Check (2 minutes)
- Open Dead Letter Queue
- Check "Unresolved Only" filter is on
- Note queue size (should be < 10)
- Scan error types for patterns
- If events present, decide: Retry, Resolve, or Escalate
Post-Incident Cleanup
- After system outage resolved
- Filter by time range (during outage)
- Select all events
- Bulk retry
- Monitor Event System Dashboard for success
- Resolve any that fail again (need individual attention)
Weekly Analysis
- Export queue to CSV
- Open in spreadsheet
- Create pivot table by event type and error
- Identify top 3 failure patterns
- Report to engineering for prioritization
Troubleshooting
Q: I retried events but they're still failing
A: The underlying issue isn't fixed. Check:
- Is the database actually back up?
- Did the external API recover?
- Is there a code bug that needs deployment?
Q: Can I see the full event payload?
A: Click the event row to expand. You'll see:
- Full error stack trace
- Event payload (JSON data)
- Metadata (user ID, deployment, timestamp)
Q: What's the difference between Resolve and Discard?
A:
- Resolve: Keeps the record, marks as handled, won't retry
- Discard: Permanently deletes the record (irreversible)
Use Resolve for audit trail, Discard only for garbage cleanup.
Q: How long do events stay in the queue?
A: Until you Retry, Resolve, or Discard them. There's no automatic expiration (by design - you shouldn't lose failed events).
Q: Can I retry events from last week?
A: Yes, but consider:
- Data might be stale (user already took alternative action)
- Side effects might be unexpected (duplicate emails)
- Review event details before retrying old events
Related Pages
- Event System Dashboard - Overview of event health
- Event Flow - Real-time event monitoring
- Event Metrics - Detailed analytics
- System Alerts - Automated failure notifications
Emergency Procedures
🚨 Queue Growing Rapidly (>100 events/hour)
- Don't panic - events are safely queued
- Check Event System Dashboard for error rate
- Identify affected event types
- Check system status (database, external APIs)
- If widespread outage: wait for recovery, then bulk retry
- If specific event type: escalate to engineering
🚨 All Events Failing (100% error rate)
- Critical system issue
- Check database connectivity
- Check external service status
- Review recent deployments
- Contact engineering immediately
- Do NOT retry until root cause identified
Remember: The Dead Letter Queue is your safety net. Events here aren't lost - they're waiting for you to help them succeed!