Journey 8: Voice Calls

Inbound calls handled by ElevenLabs voice agent with Karen AI brain

Patient calls
Twilio receives
ElevenLabs voice
Karen brain
Conversation
Action
Transcript synced
STEP 1 Phone → Twilio

Patient calls Oshun's number

A patient calls (876) 676-6297 — Oshun's published phone number. The call hits a Twilio SIP trunk, which routes it to the antigravity-core voice system. Twilio fires a TwiML webhook to the server to initiate the voice pipeline.

▶ Twilio SIP config and TwiML routing
Source files
Oshun/create_oshun_trunk.mjs — SIP trunk provisioning script
Oshun/oshun_twiml.xml — TwiML response that connects call to ElevenLabs
What happens
Twilio receives the inbound PSTN call and looks up the TwiML URL configured on the number. The TwiML instructs Twilio to connect the caller through to the ElevenLabs conversational AI endpoint via the voice webhook. The transition is seamless — the caller hears a brief connection tone, then Karen's voice.
Env vars
TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN
STEP 2 Twilio → ElevenLabs

ElevenLabs voice agent picks up

A real-time AI voice agent answers the call. The voice sounds natural — not robotic. The patient hears: "Hello, thank you for calling Oshun Beauty and Wellness. How can I help you today?" ElevenLabs handles both speech-to-text (what the patient says) and text-to-speech (Karen's replies) in real time.

▶ ElevenLabs conversation endpoint
Source files
src/webhooks/voice.js — ElevenLabs conversation endpoint (line ~350+)
registers the route that ElevenLabs calls back on for real-time events
What happens
ElevenLabs opens a WebSocket connection to the server for the duration of the call. As the patient speaks, ElevenLabs transcribes in real time and sends the text to Karen for processing. Karen's text responses are streamed back and synthesized into speech by ElevenLabs before the patient finishes speaking — keeping latency low enough for natural conversation.
Env vars
ELEVENLABS_API_KEY, ELEVENLABS_AGENT_ID
STEP 3 ElevenLabs → Karen AI

Karen processes the conversation

Each patient utterance is transcribed and routed through the Karen AI brain — the same core logic that handles WhatsApp and Instagram, but with a voice-optimized prompt that keeps responses concise and conversational. No long paragraphs; short, spoken sentences only.

▶ Voice tool handler and AI dispatch
Source files
src/webhooks/voice.jshandleElevenLabsTool()
src/ai.jschat() — same pipeline as chat channels
src/gemini.js — used for cheaper/faster voice sub-tasks
What happens
handleElevenLabsTool() receives the transcribed speech and calls chat() with a voice-mode flag. The prompt instructs Karen to respond in 1–2 short sentences max — mimicking natural speech rhythm. Gemini handles lightweight classification tasks (e.g., "is this a booking request?") to reduce OpenRouter spend on the voice pipeline. Each exchange is logged to voice_call_log.
DB tables
voice_call_log
STEP 4 Karen AI → Action

Action taken based on patient need

Karen identifies what the patient needs and acts. Simple pricing questions are answered directly. Appointment requests trigger real-time booking via Cal.com. Complex medical questions — anything Karen can't confidently answer — fire an immediate Telegram alert to Oshun staff so a human can follow up.

▶ Booking and escalation handlers
Source files
src/webhooks/voice_helpers.jshandleScheduleCallback(), handleBookAppointment()
What happens
handleBookAppointment() calls the Cal.com API to check availability and create a booking in real time while the patient is still on the call. Karen reads back the confirmed time slot. handleScheduleCallback() handles cases where no slots are immediately available — it schedules a follow-up call and sends confirmation via WhatsApp. Escalations trigger sendNotification() to the Oshun Telegram group with the caller's name, number, and what they asked.
DB tables
voice_call_log, bookings, voice_callbacks
STEP 5 Voice → Memory

Call ends, transcript synced

When the call ends, the full transcript is stored and analyzed. What the caller asked, whether a booking was made, any follow-up actions required — all of it feeds the contact's profile and memory system, the same as a WhatsApp conversation would.

▶ Transcript sync and memory consolidation
Source files
src/watchdog_voice.jsrunCallTranscriptSync()
What happens
runCallTranscriptSync() runs on a scheduled interval (via watchdog). It pulls completed call transcripts from ElevenLabs, matches them to contacts by phone number, and stores the full exchange in memory. Key facts extracted ("asked about lip fillers", "booked for Friday") are written to memory_knowledge with vector embeddings. conversation_metrics tracks call outcome (booked / info only / escalated / no-answer).
DB tables
voice_call_log, conversation_metrics, memory_conversations
Gotcha
Outbound calls are expensive at approximately $0.30 per call — cost optimization for outbound voice campaigns is pending. Inbound calls are covered by the Twilio number rental fee.