Hourly health checks, daily summaries, alerts, and auto-remediation
Every hour, the watchdog fires a comprehensive health check across all infrastructure and tenant systems. The check is queued via BullMQ so it runs reliably even if the previous check is still in progress.
runHourlyCheck()0 * * * * (top of every hour)The watchdog probes each layer of the stack: PostgreSQL connectivity, Redis availability, SMTP reachability, disk space usage, memory headroom, and Docker container health. Any failure is flagged for remediation or alerting.
checkInfrastructure() (line 18)Beyond infrastructure, the watchdog checks Oshun-specific systems: Chatwoot API health, freshness of Google Calendar and Drive OAuth tokens, third-party API key validity, BullMQ queue depth, and memory system integrity.
checkTenantSystems() (line 217)tryRefreshCalendarToken() (line 454), tryRefreshDriveToken() (line 491)checkMemorySystems() (line 91)For a defined set of known-safe failures, the watchdog fixes the problem itself without waking anyone up. Stuck BullMQ jobs are retried, Listmonk configuration drift is corrected, and expired OAuth tokens for Calendar and Drive are refreshed automatically.
autoRemediateBullMQ()autoRemediateListmonk()tryRefreshCalendarToken(), tryRefreshDriveToken() in src/watchdog_checks.jsWhen a check fails and auto-remediation is not possible, the watchdog immediately sends a Telegram alert to Matthew. The alert includes what failed, the severity level, and a suggested fix so no time is lost diagnosing the problem.
formatUrgentAlert()sendNotification()Every morning at 8 AM EST, a digest lands in Telegram covering the past 24 hours: checks passed vs. failed, uptime percentage, queue health, auto-remediations performed, and any outstanding issues that still need attention.
runDailySummary()formatDailySummary()0 13 * * * (8 AM EST = 13:00 UTC)Every Monday morning Karen's conversation performance is analyzed in depth: response quality scores, failure patterns, and rules learned from corrections. Insights are stored as playbooks so Karen improves automatically week over week.
0 14 * * 1 (Monday 9 AM EST)Daily, the watchdog probes the Twilio SIP trunk and ElevenLabs API to confirm the voice bridge is operational. It also syncs call transcripts from completed calls into the database so Karen can reference past voice interactions in future conversations.
runVoiceBridgeMonitor()runCallTranscriptSync()0 14 * * * (9 AM EST daily)