Voice AI is fundamentally a latency and state-management problem. A 500ms delay feels like an eternity on a phone call. LLMs are stateless text generators, but sales calls require rigid adherence to a script and the ability to instantly halt audio playback when a human interrupts mid-sentence (barge-in).
Architected a dual-stack system (Next.js + Python FastAPI) that holds open WebSocket connections with Twilio. Audio streams continuously into Deepgram for STT, while a custom ONNX model detects utterance ends. The orchestrator uses BAML to force the LLM to navigate a strict state machine, streaming the response directly into Cartesia TTS.
"We secured over $20,000 in cloud infrastructure credits across AWS, Azure, and Groq to scale this. But the consumer Voice AI market commoditized in a matter of months. Shelving it taught me that shipping perfect code into a hyper-saturated red ocean is still a failing business strategy. Pragmatism over the sunk-cost fallacy."
Twilio connects to the FastAPI backend via WebSockets. PCM audio streams continuously and is piped directly to Deepgram for real-time Speech-to-Text.
A custom ONNX Voice Activity Detection model analyzes audio embeddings to detect end-of-utterance. If the user interrupts, a 'clear' signal is sent immediately to halt the ongoing TTS audio stream (barge-in).
The orchestrator routes the transcript through RouteLLM. BoundaryML (BAML) enforces strict schema outputs, navigating the sales script state-machine and querying Qdrant for document RAG context.
As the LLM generates the response text, it is streamed token-by-token into Cartesia or Deepgram Aura TTS, generating audio bytes that are piped directly back to Twilio.
Node.js is notoriously bad at processing heavy raw byte arrays (audio) and ONNX models. Moving the core engine to Python (FastAPI) gave us access to native ML tooling and drastically reduced memory leaks.
We had to maintain a dual-stack setup (Next.js frontend, Python backend), duplicating type definitions across boundaries and complicating CI/CD deployments.