Deploying Voice Agents to Production
These are my notes on the session about deploying voice agents to production of the Voice Agents Course.
TLDR:
- Use a voice AI provider for simple, scalable deployment for production.
- Use a single VM or your homelab for demos.
Differences between voice agents and traditional apps:
- are mostly in the transport
- persistent connnection (minutes)
- bidirectional streaming
- stateful sessions
Voice agents in production need:
- A http service for
- API endpoints,
- a website, and
- webhooks,
- spawning bots.
- A media transport layer or service:
- WebRTC based for client-to-server (udp), or
- websocket based for server-to-server (tcp).
- Bots (udp or tcp, connect to media transport layer)
Bots:
- Are instances of the agents.
- Can be written in Python with PipeCat.
- Use STT, LLM, and TTS providers, which also are the main cost and latency drivers.
- Usually come packaged with small models, eg. for voice activity detection (VAD)
- Each spawned bot serves one session and needs allocated resources during the whole session:
- 0,5 vCPU
- 1 GB RAM
- 40kbps for WebRTC audio (in 30-60 kbps range)
- video requires more CPU (eg. 1 vCPU), and bandwidth
- Need to be quickly available. Target time-to-first-word:
- 2-3 secs (web),
- 3-5 secs (phone)
Ways to solve the “fast start challenge”:
- percentage-based warm pool
- fast startup times (caching, pre-loading)
- proactive/predictive scheduling
- fallbacks from reactive world (eg. UX based solutions, not just silent fails)
Infra providers:
- need to support tcp and udp
- voice ai providers are easiest (Pipecat Cloud, Daily, Vapi, Layercode)
- Fly.io (and potentially other container platforms) are good if they support udp (Fly does)
- ML focused provides are good for converged bots with larger models included (gpu clouds)
- hyperscalers are flexible but complex
- BTW: CloudRun does not support udp
- demos can run on single VMs or even be served from a home lab
- by serving everything converged, time-to-first word can get down to 500ms
- otherwise 800-1000 ms is good enough and achievable
- proximity to users matters (Daily plans global regions for PipeCat cloud, currently only us-west)
- conn between servers can be implemented with WebSockets
After the session I have looked at the PipeCat examples and realized it should be easy enough to run a basic voice agent with PipeCat’s SmallWebRTCTransport on a virtual server hosted in Europe and then switch the transport and deploy it to production on PipeCat Cloud.
I will probably try that to see if the US based PipeCat cloud is fast enough for users in Europe and how much of a difference the longer network round trip time makes.
This post is licensed under CC BY 4.0 by the author.