Archived 2024

SLOT AI

Real-time Voice AI orchestration platform. A technical deep-dive into sub-200ms conversational agents, WebSocket streaming, and the pragmatic reality of market timing.

<200msTurn Latency
$20k+Cloud Grants Raised
ONNXTurn Detection
WSSStreaming Audio

THE PROBLEM

Latency & State

Voice AI is fundamentally a latency and state-management problem. A 500ms delay feels like an eternity on a phone call. LLMs are stateless text generators, but sales calls require rigid adherence to a script and the ability to instantly halt audio playback when a human interrupts mid-sentence (barge-in).

THE SOLUTION

Streaming Orchestration

Architected a dual-stack system (Next.js + Python FastAPI) that holds open WebSocket connections with Twilio. Audio streams continuously into Deepgram for STT, while a custom ONNX model detects utterance ends. The orchestrator uses BAML to force the LLM to navigate a strict state machine, streaming the response directly into Cartesia TTS.

Engineering Perspective

"We secured over $20,000 in cloud infrastructure credits across AWS, Azure, and Groq to scale this. But the consumer Voice AI market commoditized in a matter of months. Shelving it taught me that shipping perfect code into a hyper-saturated red ocean is still a failing business strategy. Pragmatism over the sunk-cost fallacy."

REAL-TIME PIPELINE

1. Stream Ingestion

Twilio connects to the FastAPI backend via WebSockets. PCM audio streams continuously and is piped directly to Deepgram for real-time Speech-to-Text.

2. VAD & Turn Detection

A custom ONNX Voice Activity Detection model analyzes audio embeddings to detect end-of-utterance. If the user interrupts, a 'clear' signal is sent immediately to halt the ongoing TTS audio stream (barge-in).

3. Orchestration & RAG

The orchestrator routes the transcript through RouteLLM. BoundaryML (BAML) enforces strict schema outputs, navigating the sales script state-machine and querying Qdrant for document RAG context.

4. Token-by-Token Synthesis

As the LLM generates the response text, it is streamed token-by-token into Cartesia or Deepgram Aura TTS, generating audio bytes that are piped directly back to Twilio.

TECHNICAL ARCHITECTURE

TELEPHONY LAYER
Twilio Media Streams
Bidirectional WebSocket (WSS)
VOICE ORCHESTRATION ENGINE (PYTHON)
FastAPI + Async Uvicorn
ONNX Turn DetectorBAML Type-Safe LLM RouterAPScheduler
STT / TTS
Deepgram
Streaming Nova-3
VECTOR DB
Qdrant
RAG Context
DATA
PostgreSQL
Supabase + Redis

TRADE-OFFS & STACK

Splitting the Stack

What we gainedTensor & LLM Ecosystem

Node.js is notoriously bad at processing heavy raw byte arrays (audio) and ONNX models. Moving the core engine to Python (FastAPI) gave us access to native ML tooling and drastically reduced memory leaks.

What we sacrificedMonolithic Simplicity

We had to maintain a dual-stack setup (Next.js frontend, Python backend), duplicating type definitions across boundaries and complicating CI/CD deployments.

Python 3.11FastAPIBAMLWebSocketsQdrant