Journey 8: Voice Calls — Oshun System Docs

STEP 1 Phone → Twilio

Patient calls Oshun's number

A patient calls (876) 676-6297 — Oshun's published phone number. The call hits a Twilio SIP trunk, which routes it to the antigravity-core voice system. Twilio fires a TwiML webhook to the server to initiate the voice pipeline.

▶ Twilio SIP config and TwiML routing ▶

Source files

Oshun/create_oshun_trunk.mjs — SIP trunk provisioning script
Oshun/oshun_twiml.xml — TwiML response that connects call to ElevenLabs

What happens

Twilio receives the inbound PSTN call and looks up the TwiML URL configured on the number. The TwiML instructs Twilio to connect the caller through to the ElevenLabs conversational AI endpoint via the voice webhook. The transition is seamless — the caller hears a brief connection tone, then Karen's voice.

Env vars

TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN

STEP 2 Twilio → ElevenLabs

ElevenLabs voice agent picks up

A real-time AI voice agent answers the call. The voice sounds natural — not robotic. The patient hears: "Hello, thank you for calling Oshun Beauty and Wellness. How can I help you today?" ElevenLabs handles both speech-to-text (what the patient says) and text-to-speech (Karen's replies) in real time.

▶ ElevenLabs conversation endpoint ▶

Source files

src/webhooks/voice.js — ElevenLabs conversation endpoint (line ~350+)
registers the route that ElevenLabs calls back on for real-time events

What happens

ElevenLabs opens a WebSocket connection to the server for the duration of the call. As the patient speaks, ElevenLabs transcribes in real time and sends the text to Karen for processing. Karen's text responses are streamed back and synthesized into speech by ElevenLabs before the patient finishes speaking — keeping latency low enough for natural conversation.

Env vars

ELEVENLABS_API_KEY, ELEVENLABS_AGENT_ID

STEP 3 ElevenLabs → Karen AI

Karen processes the conversation

Each patient utterance is transcribed and routed through the Karen AI brain — the same core logic that handles WhatsApp and Instagram, but with a voice-optimized prompt that keeps responses concise and conversational. No long paragraphs; short, spoken sentences only.

▶ Voice tool handler and AI dispatch ▶

Source files

src/webhooks/voice.js → handleElevenLabsTool()
src/ai.js → chat() — same pipeline as chat channels
src/gemini.js — used for cheaper/faster voice sub-tasks

What happens

handleElevenLabsTool() receives the transcribed speech and calls chat() with a voice-mode flag. The prompt instructs Karen to respond in 1–2 short sentences max — mimicking natural speech rhythm. Gemini handles lightweight classification tasks (e.g., "is this a booking request?") to reduce OpenRouter spend on the voice pipeline. Each exchange is logged to voice_call_log.

DB tables

voice_call_log

STEP 4 Karen AI → Action

Action taken based on patient need

Karen identifies what the patient needs and acts. Simple pricing questions are answered directly. Appointment requests trigger real-time booking via Cal.com. Complex medical questions — anything Karen can't confidently answer — fire an immediate Telegram alert to Oshun staff so a human can follow up.

▶ Booking and escalation handlers ▶

Source files

src/webhooks/voice_helpers.js → handleScheduleCallback(), handleBookAppointment()

What happens

handleBookAppointment() calls the Cal.com API to check availability and create a booking in real time while the patient is still on the call. Karen reads back the confirmed time slot. handleScheduleCallback() handles cases where no slots are immediately available — it schedules a follow-up call and sends confirmation via WhatsApp. Escalations trigger sendNotification() to the Oshun Telegram group with the caller's name, number, and what they asked.

DB tables

voice_call_log, bookings, voice_callbacks

STEP 5 Voice → Memory

Call ends, transcript synced

When the call ends, the full transcript is stored and analyzed. What the caller asked, whether a booking was made, any follow-up actions required — all of it feeds the contact's profile and memory system, the same as a WhatsApp conversation would.

▶ Transcript sync and memory consolidation ▶

Source files

src/watchdog_voice.js → runCallTranscriptSync()

What happens

runCallTranscriptSync() runs on a scheduled interval (via watchdog). It pulls completed call transcripts from ElevenLabs, matches them to contacts by phone number, and stores the full exchange in memory. Key facts extracted ("asked about lip fillers", "booked for Friday") are written to memory_knowledge with vector embeddings. conversation_metrics tracks call outcome (booked / info only / escalated / no-answer).

DB tables

voice_call_log, conversation_metrics, memory_conversations

Gotcha

Outbound calls are expensive at approximately $0.30 per call — cost optimization for outbound voice campaigns is pending. Inbound calls are covered by the Twilio number rental fee.