Solo FounderRole
StartupType
2024–2025Timeline
$20k+Cloud Credits

THE PROBLEM

Latency & State

Voice AI is fundamentally a latency and state-management problem. A 500ms delay feels like an eternity on a phone call. LLMs are stateless text generators, but sales calls require rigid adherence to a script and the ability to instantly halt audio playback when a human interrupts mid-sentence — barge-in. Every existing solution at the time either had unacceptable latency or poor script control. We built our own.

THE PROCESS

How I approached it

I isolated the three failure points independently: turn detection (when does the human stop talking?), LLM response time (how fast can we generate a reply?), and audio streaming (how quickly can we get audio back to Twilio?). Each was optimized in isolation before the full pipeline was connected. The custom ONNX VAD model and the BAML state machine were the two bets that paid off most.

Engineering Perspective

"We secured over $20,000 in cloud infrastructure credits across AWS, Azure, and Groq to scale this. But the consumer Voice AI market commoditized in a matter of months. Shelving it taught me that shipping perfect code into a hyper-saturated red ocean is still a failing business strategy. Pragmatism over the sunk-cost fallacy."

REAL-TIME PIPELINE

1. Stream Ingestion

Twilio connects to the FastAPI backend via WebSockets. PCM audio streams continuously and is piped directly to Deepgram for real-time Speech-to-Text.

2. VAD & Turn Detection

A custom ONNX Voice Activity Detection model analyzes audio embeddings to detect end-of-utterance. If the user interrupts, a 'clear' signal is sent immediately to halt the ongoing TTS audio stream (barge-in).

3. Orchestration & RAG

The orchestrator routes the transcript through RouteLLM. BoundaryML (BAML) enforces strict schema outputs, navigating the sales script state-machine and querying Qdrant for document RAG context.

4. Token-by-Token Synthesis

As the LLM generates the response text, it is streamed token-by-token into Cartesia or Deepgram Aura TTS, generating audio bytes that are piped directly back to Twilio.

KEY DECISIONS

Why I built it this way

What we gainedTensor & LLM Ecosystem

Node.js is notoriously bad at processing heavy raw byte arrays (audio) and ONNX models. Moving the core engine to Python (FastAPI) gave us access to native ML tooling and drastically reduced memory leaks.

What we sacrificedMonolithic Simplicity

We had to maintain a dual-stack setup (Next.js frontend, Python backend), duplicating type definitions across boundaries and complicating CI/CD deployments.

THE SOLUTION

Streaming Orchestration

A dual-stack system (Next.js + Python FastAPI) holds open WebSocket connections with Twilio. Audio streams continuously into Deepgram for STT, while a custom ONNX model detects utterance ends. The orchestrator uses BAML to force the LLM to navigate a strict state machine, streaming the response directly into Cartesia TTS. Sub-200ms end-to-end turn latency on real calls.

TECHNICAL ARCHITECTURE

TELEPHONY LAYER
Twilio Media Streams
Bidirectional WebSocket (WSS)
VOICE ORCHESTRATION ENGINE (PYTHON)
FastAPI + Async Uvicorn
ONNX Turn DetectorBAML Type-Safe LLM RouterAPScheduler
STT / TTS
Deepgram
Streaming Nova-3
VECTOR DB
Qdrant
RAG Context
DATA
PostgreSQL
Supabase + Redis

RESULTS

<200msTurn Latency
$20k+Cloud Credits
SoloBuilt By
2025Archived

The system worked. AWS, Azure, Groq, Digital Ocean, and Modal.com all granted infrastructure credits — validation that the pitch was credible. We shelved it when the consumer Voice AI market commoditized faster than we could find a defensible position.

LEARNINGS

What Worked

  • — Technical execution was sound: custom ONNX VAD, BAML state machine navigation, sub-200ms latency on real calls.
  • — Raising $20k+ in cloud credits proved the idea had perceived value and gave us real infrastructure to build on.

What I'd Do Differently

  • — Validate market timing before the first line of code. Voice AI commoditized in months. We needed a defensible niche, not a better general-purpose platform.
  • — The dual-stack (Next.js + Python) was technically justified but doubled CI/CD complexity. A Python-only backend would have been faster to iterate on.
Python 3.11FastAPIBAMLWebSocketsQdrant