Journey 10: Watchdog & Monitoring

Hourly health checks, daily summaries, alerts, and auto-remediation

1
2
3
4
5
6
7
8
Step 1 Cron → Watchdog

Hourly health checks

Every hour, the watchdog fires a comprehensive health check across all infrastructure and tenant systems. The check is queued via BullMQ so it runs reliably even if the previous check is still in progress.

Technical details
Entry point
src/watchdog_runners.js → runHourlyCheck()
Orchestrator
src/watchdog.js — BullMQ queue
Cron schedule
0 * * * * (top of every hour)
Database tables
watchdog_reports
Step 2 Watchdog → Infrastructure

Infrastructure checks

The watchdog probes each layer of the stack: PostgreSQL connectivity, Redis availability, SMTP reachability, disk space usage, memory headroom, and Docker container health. Any failure is flagged for remediation or alerting.

Technical details
Function
src/watchdog_checks.js → checkInfrastructure() (line 18)
Checks
PostgreSQL, Redis, SMTP, disk space, memory, Docker health
Database tables
watchdog_reports
Step 3 Watchdog → Tenant Systems

Tenant system checks

Beyond infrastructure, the watchdog checks Oshun-specific systems: Chatwoot API health, freshness of Google Calendar and Drive OAuth tokens, third-party API key validity, BullMQ queue depth, and memory system integrity.

Technical details
Tenant checks
src/watchdog_checks.js → checkTenantSystems() (line 217)
Token refresh probes
tryRefreshCalendarToken() (line 454), tryRefreshDriveToken() (line 491)
Memory check
src/watchdog_checks.js → checkMemorySystems() (line 91)
Database tables
watchdog_reports, memory_sessions
Step 4 Watchdog → Fix

Auto-remediation

For a defined set of known-safe failures, the watchdog fixes the problem itself without waking anyone up. Stuck BullMQ jobs are retried, Listmonk configuration drift is corrected, and expired OAuth tokens for Calendar and Drive are refreshed automatically.

Technical details
BullMQ remediation
src/watchdog_remediation.js → autoRemediateBullMQ()
Listmonk remediation
src/watchdog_remediation.js → autoRemediateListmonk()
Token refresh
tryRefreshCalendarToken(), tryRefreshDriveToken() in src/watchdog_checks.js
Step 5 Watchdog → Telegram

Urgent alerts

When a check fails and auto-remediation is not possible, the watchdog immediately sends a Telegram alert to Matthew. The alert includes what failed, the severity level, and a suggested fix so no time is lost diagnosing the problem.

Technical details
Alert formatter
src/watchdog_formatters.js → formatUrgentAlert()
Delivery
src/telegram.js → sendNotification()
Database tables
watchdog_reports
Step 6 Watchdog → Telegram

Daily summary

Every morning at 8 AM EST, a digest lands in Telegram covering the past 24 hours: checks passed vs. failed, uptime percentage, queue health, auto-remediations performed, and any outstanding issues that still need attention.

Technical details
Runner
src/watchdog_runners.js → runDailySummary()
Formatter
src/watchdog_formatters.js → formatDailySummary()
Cron schedule
0 13 * * * (8 AM EST = 13:00 UTC)
Database tables
watchdog_reports
Step 7 Feedback Loop

Weekly feedback analysis

Every Monday morning Karen's conversation performance is analyzed in depth: response quality scores, failure patterns, and rules learned from corrections. Insights are stored as playbooks so Karen improves automatically week over week.

Technical details
Weekly analysis
src/actions/feedback_loop/weekly.js
Rule learning
src/actions/feedback_loop/rules.js
Scoring
src/actions/feedback_loop/scoring.js
Cron schedule
0 14 * * 1 (Monday 9 AM EST)
Database tables
conversation_scores, learned_rules, conversation_playbooks
Step 8 Watchdog → Voice

Voice bridge monitoring

Daily, the watchdog probes the Twilio SIP trunk and ElevenLabs API to confirm the voice bridge is operational. It also syncs call transcripts from completed calls into the database so Karen can reference past voice interactions in future conversations.

Technical details
Bridge monitor
src/watchdog_voice.js → runVoiceBridgeMonitor()
Transcript sync
src/watchdog_voice.js → runCallTranscriptSync()
Cron schedule
0 14 * * * (9 AM EST daily)
Database tables
voice_call_log, voice_callbacks