Deploying Voice Agents to Production

Posted May 9, 2025

An AI generated pop art style image showing a server stack under a cloud, an arrow pointing to a smart phone, two classic telephone receiver and an icon which could symbolize AI.

By Mitja Martini 2 min read

These are my notes on the session about deploying voice agents to production of the Voice Agents Course.

TLDR:

Use a voice AI provider for simple, scalable deployment for production.
Use a single VM or your homelab for demos.

Differences between voice agents and traditional apps:

are mostly in the transport
persistent connnection (minutes)
bidirectional streaming
stateful sessions

Voice agents in production need:

A http service for
- API endpoints,
- a website, and
- webhooks,
- spawning bots.
A media transport layer or service:
- WebRTC based for client-to-server (udp), or
- websocket based for server-to-server (tcp).
Bots (udp or tcp, connect to media transport layer)

Bots:

Are instances of the agents.
Can be written in Python with PipeCat.
Use STT, LLM, and TTS providers, which also are the main cost and latency drivers.
Usually come packaged with small models, eg. for voice activity detection (VAD)
Each spawned bot serves one session and needs allocated resources during the whole session:
- 0,5 vCPU
- 1 GB RAM
- 40kbps for WebRTC audio (in 30-60 kbps range)
- video requires more CPU (eg. 1 vCPU), and bandwidth
Need to be quickly available. Target time-to-first-word:
- 2-3 secs (web),
- 3-5 secs (phone)

Ways to solve the “fast start challenge”:

percentage-based warm pool
fast startup times (caching, pre-loading)
proactive/predictive scheduling
fallbacks from reactive world (eg. UX based solutions, not just silent fails)

Infra providers:

need to support tcp and udp
voice ai providers are easiest (Pipecat Cloud, Daily, Vapi, Layercode)
Fly.io (and potentially other container platforms) are good if they support udp (Fly does)
ML focused provides are good for converged bots with larger models included (gpu clouds)
hyperscalers are flexible but complex
BTW: CloudRun does not support udp
demos can run on single VMs or even be served from a home lab
by serving everything converged, time-to-first word can get down to 500ms
otherwise 800-1000 ms is good enough and achievable
proximity to users matters (Daily plans global regions for PipeCat cloud, currently only us-west)
conn between servers can be implemented with WebSockets

After the session I have looked at the PipeCat examples and realized it should be easy enough to run a basic voice agent with PipeCat’s SmallWebRTCTransport on a virtual server hosted in Europe and then switch the transport and deploy it to production on PipeCat Cloud.

I will probably try that to see if the US based PipeCat cloud is fast enough for users in Europe and how much of a difference the longer network round trip time makes.

Voice Agents

This post is licensed under CC BY 4.0 by the author.

Trending Tags