Journey 10: Watchdog & Monitoring

Step 1 Cron → Watchdog

Hourly health checks

Every hour, the watchdog fires a comprehensive health check across all infrastructure and tenant systems. The check is queued via BullMQ so it runs reliably even if the previous check is still in progress.

Technical details ▶

Entry point

src/watchdog_runners.js → runHourlyCheck()

Orchestrator

src/watchdog.js — BullMQ queue

Cron schedule

0 * * * * (top of every hour)

Database tables

watchdog_reports

Step 2 Watchdog → Infrastructure

Infrastructure checks

The watchdog probes each layer of the stack: PostgreSQL connectivity, Redis availability, SMTP reachability, disk space usage, memory headroom, and Docker container health. Any failure is flagged for remediation or alerting.

Technical details ▶

Function

src/watchdog_checks.js → checkInfrastructure() (line 18)

Checks

PostgreSQL, Redis, SMTP, disk space, memory, Docker health

Database tables

watchdog_reports

Step 3 Watchdog → Tenant Systems

Tenant system checks

Beyond infrastructure, the watchdog checks Oshun-specific systems: Chatwoot API health, freshness of Google Calendar and Drive OAuth tokens, third-party API key validity, BullMQ queue depth, and memory system integrity.

Technical details ▶

Tenant checks

src/watchdog_checks.js → checkTenantSystems() (line 217)

Token refresh probes

tryRefreshCalendarToken() (line 454), tryRefreshDriveToken() (line 491)

Memory check

src/watchdog_checks.js → checkMemorySystems() (line 91)

Database tables

watchdog_reports, memory_sessions

Step 4 Watchdog → Fix

Auto-remediation

For a defined set of known-safe failures, the watchdog fixes the problem itself without waking anyone up. Stuck BullMQ jobs are retried, Listmonk configuration drift is corrected, and expired OAuth tokens for Calendar and Drive are refreshed automatically.

Technical details ▶

BullMQ remediation

src/watchdog_remediation.js → autoRemediateBullMQ()

Listmonk remediation

src/watchdog_remediation.js → autoRemediateListmonk()

Token refresh

tryRefreshCalendarToken(), tryRefreshDriveToken() in src/watchdog_checks.js

Step 5 Watchdog → Telegram

Urgent alerts

When a check fails and auto-remediation is not possible, the watchdog immediately sends a Telegram alert to Matthew. The alert includes what failed, the severity level, and a suggested fix so no time is lost diagnosing the problem.

Technical details ▶

Alert formatter

src/watchdog_formatters.js → formatUrgentAlert()

Delivery

src/telegram.js → sendNotification()

Database tables

watchdog_reports

Step 6 Watchdog → Telegram

Daily summary

Every morning at 8 AM EST, a digest lands in Telegram covering the past 24 hours: checks passed vs. failed, uptime percentage, queue health, auto-remediations performed, and any outstanding issues that still need attention.

Technical details ▶

Runner

src/watchdog_runners.js → runDailySummary()

Formatter

src/watchdog_formatters.js → formatDailySummary()

Cron schedule

0 13 * * * (8 AM EST = 13:00 UTC)

Database tables

watchdog_reports

Step 7 Feedback Loop

Weekly feedback analysis

Every Monday morning Karen's conversation performance is analyzed in depth: response quality scores, failure patterns, and rules learned from corrections. Insights are stored as playbooks so Karen improves automatically week over week.

Technical details ▶

Weekly analysis

src/actions/feedback_loop/weekly.js

Rule learning

src/actions/feedback_loop/rules.js

Scoring

src/actions/feedback_loop/scoring.js

Cron schedule

0 14 * * 1 (Monday 9 AM EST)

Database tables

conversation_scores, learned_rules, conversation_playbooks

Step 8 Watchdog → Voice

Voice bridge monitoring

Daily, the watchdog probes the Twilio SIP trunk and ElevenLabs API to confirm the voice bridge is operational. It also syncs call transcripts from completed calls into the database so Karen can reference past voice interactions in future conversations.

Technical details ▶

Bridge monitor

src/watchdog_voice.js → runVoiceBridgeMonitor()

Transcript sync

src/watchdog_voice.js → runCallTranscriptSync()

Cron schedule

0 14 * * * (9 AM EST daily)

Database tables

voice_call_log, voice_callbacks