Skip to content

Voice agents

Build real-time voice agents with speech-to-text, text-to-speech, and conversation persistence. Audio streams over WebSocket — no SFU or meeting infrastructure required. Beta

Overview

@cloudflare/voice provides two server-side mixins and matching client libraries:

ExportImportPurpose
withVoice@cloudflare/voiceFull voice agent: STT, LLM, TTS, persistence
withVoiceInput@cloudflare/voiceSTT-only: transcription without response
useVoiceAgent@cloudflare/voice/reactReact hook for withVoice agents
useVoiceInput@cloudflare/voice/reactReact hook for withVoiceInput agents
VoiceClient@cloudflare/voice/clientFramework-agnostic client

Built on Cloudflare Durable Objects, you get:

  • Real-time audio — mic audio streams as binary WebSocket frames, TTS audio streams back
  • Automatic conversation persistence — messages stored in SQLite, survive restarts
  • Streaming TTS — LLM tokens are sentence-chunked and synthesized concurrently
  • Interruption handling — user speech during playback cancels the current response
  • Continuous STT — per-call transcriber session, model handles turn detection
  • Pipeline hooks — intercept and transform text at every stage

Quick start

Install

Terminal window
npm install @cloudflare/voice agents

Server

JavaScript
import { Agent } from "agents";
import { withVoice, WorkersAIFluxSTT, WorkersAITTS } from "@cloudflare/voice";
const VoiceAgent = withVoice(Agent);
export class MyAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
async onTurn(transcript, context) {
return "Hello! I heard you say: " + transcript;
}
}

Client (React)

import { useVoiceAgent } from "@cloudflare/voice/react";
function VoiceUI() {
const {
status,
transcript,
interimTranscript,
audioLevel,
isMuted,
startCall,
endCall,
toggleMute,
} = useVoiceAgent({ agent: "MyAgent" });
return (
<div>
<p>Status: {status}</p>
<button onClick={status === "idle" ? startCall : endCall}>
{status === "idle" ? "Start Call" : "End Call"}
</button>
<button onClick={toggleMute}>{isMuted ? "Unmute" : "Mute"}</button>
{interimTranscript && (
<p>
<em>{interimTranscript}</em>
</p>
)}
{transcript.map((msg, i) => (
<p key={i}>
<strong>{msg.role}:</strong> {msg.text}
</p>
))}
</div>
);
}

Wrangler configuration

JSONC
{
"ai": {
"binding": "AI"
},
"durable_objects": {
"bindings": [
{
"name": "MyAgent",
"class_name": "MyAgent"
}
]
},
"migrations": [
{
"tag": "v1",
"new_sqlite_classes": ["MyAgent"]
}
]
}

How it works

Browser Durable Object (withVoice)
┌──────────┐ ┌──────────────────────────┐
│ Mic │ binary PCM (16kHz) │ Transcriber session │
│ │ ──────────────────────► │ (per-call, continuous) │
│ │ │ ↓ model detects turn │
│ │ JSON: transcript │ onTurn() → your LLM code │
│ │ ◄────────────────────── │ ↓ (sentence chunking) │
│ │ binary: audio │ TTS │
│ Speaker │ ◄────────────────────── │ │
└──────────┘ └──────────────────────────┘
  1. The client captures mic audio and sends it as binary WebSocket frames (16kHz mono 16-bit PCM).
  2. Audio streams continuously to the transcriber session (created at start_call, lives for the entire call).
  3. The STT model detects when the user finishes an utterance and fires onUtterance. All providers use model-driven turn detection — the client does not need to signal end-of-speech for STT.
  4. Your onTurn() method runs — typically an LLM call.
  5. The response is sentence-chunked and synthesized via TTS.
  6. Audio streams back to the client for playback.

The client receives transcript_interim messages with partial results as the user speaks, so you can show real-time feedback in the UI.

Server API: withVoice

withVoice(Agent) adds the full voice pipeline to an Agent class.

Providers

Set providers as class properties. Class field initializers run after super(), so this.env is available.

PropertyTypeRequiredDescription
transcriberTranscriberYesContinuous per-call STT provider
ttsTTSProviderYesText-to-speech
JavaScript
import { withVoice, WorkersAIFluxSTT, WorkersAITTS } from "@cloudflare/voice";
const VoiceAgent = withVoice(Agent);
export class MyAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
}

For runtime model switching (for example, a Flux vs Nova 3 dropdown), override createTranscriber:

JavaScript
export class MyAgent extends VoiceAgent {
tts = new WorkersAITTS(this.env.AI);
createTranscriber(connection) {
return new WorkersAIFluxSTT(this.env.AI);
}
}

onTurn(transcript, context)

Required. Called when the user finishes speaking and the transcript is ready.

Return a string, AsyncIterable<string>, or ReadableStream for streaming responses.

Simple response:

JavaScript
export class MyAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
async onTurn(transcript, context) {
return "You said: " + transcript;
}
}

Streaming response (recommended for LLM):

JavaScript
import { streamText } from "ai";
import { createWorkersAI } from "workers-ai-provider";
export class MyAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
async onTurn(transcript, context) {
const workersai = createWorkersAI({ binding: this.env.AI });
const result = streamText({
model: workersai("@cf/moonshotai/kimi-k2.5"),
system: "You are a helpful voice assistant. Keep responses concise.",
messages: [
...context.messages.map((m) => ({
role: m.role,
content: m.content,
})),
{ role: "user", content: transcript },
],
abortSignal: context.signal,
});
return result.textStream;
}
}

The context object provides:

FieldTypeDescription
connectionConnectionThe WebSocket connection
messagesArray<{ role: string; content: string }>Conversation history from SQLite
signalAbortSignalAborted on interrupt or disconnect

Lifecycle hooks

MethodDescription
beforeCallStart(connection)Return false to reject the call
onCallStart(connection)Called after a call is accepted
onCallEnd(connection)Called when a call ends
onInterrupt(connection)Called when user interrupts during playback

Pipeline hooks

Intercept and transform data at each pipeline stage. Return null to skip the current utterance.

MethodReceivesCan skip?
afterTranscribe(transcript, connection)STT textYes
beforeSynthesize(text, connection)Text before TTSYes
afterSynthesize(audio, text, connection)Audio after TTSYes
JavaScript
import {} from "agents";
export class MyAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
afterTranscribe(transcript, connection) {
if (transcript.length < 3) return null;
return transcript;
}
beforeSynthesize(text, connection) {
return text.replace(/\bAI\b/g, "A.I.");
}
async onTurn(transcript, context) {
return transcript;
}
}

Convenience methods

MethodDescription
speak(connection, text)Synthesize and send audio to one connection
speakAll(text)Synthesize and send audio to all connections
forceEndCall(connection)Programmatically end a call
saveMessage(role, text)Persist a message to conversation history
getConversationHistory()Retrieve conversation history from SQLite

Configuration options

Pass options to withVoice() as the second argument:

JavaScript
const VoiceAgent = withVoice(Agent, {
historyLimit: 20,
audioFormat: "mp3",
maxMessageCount: 1000,
});
OptionTypeDefaultDescription
historyLimitnumber20Max messages loaded for context
audioFormatstring"mp3"Audio format sent to client
maxMessageCountnumber1000Max messages stored in SQLite

Server API: withVoiceInput

withVoiceInput(Agent) adds STT-only voice input — no TTS, no LLM, no response generation. Use this for dictation, search-by-voice, or any UI where you need speech-to-text without a conversational agent.

JavaScript
import { Agent } from "agents";
import { withVoiceInput, WorkersAINova3STT } from "@cloudflare/voice";
const InputAgent = withVoiceInput(Agent);
export class DictationAgent extends InputAgent {
transcriber = new WorkersAINova3STT(this.env.AI);
onTranscript(text, connection) {
console.log("User said:", text);
}
}

onTranscript(text, connection)

Called after each utterance is transcribed. Override this to process the transcript.

Hooks

withVoiceInput supports the same lifecycle hooks as withVoice:

  • beforeCallStart(connection) — return false to reject
  • onCallStart(connection), onCallEnd(connection), onInterrupt(connection)
  • createTranscriber(connection) — override for runtime model switching
  • afterTranscribe(transcript, connection) — filter or transform transcripts

It does not have TTS hooks (beforeSynthesize, afterSynthesize) or onTurn.

Client API: React hooks

useVoiceAgent

Wraps VoiceClient for withVoice agents. Manages connection, mic capture, playback, silence detection, and interrupt detection.

import { useVoiceAgent } from "@cloudflare/voice/react";
const {
status, // "idle" | "listening" | "thinking" | "speaking"
transcript, // TranscriptMessage[] — conversation history
interimTranscript, // string | null — real-time partial transcript
metrics, // VoicePipelineMetrics | null
audioLevel, // number (0–1) — current mic RMS level
isMuted, // boolean
connected, // boolean — WebSocket connected
error, // string | null
startCall, // () => Promise<void>
endCall, // () => void
toggleMute, // () => void
sendText, // (text: string) => void — bypass STT
sendJSON, // (data: Record<string, unknown>) => void
lastCustomMessage, // unknown — last non-voice message from server
} = useVoiceAgent({
agent: "MyAgent",
name: "default",
host: window.location.host,
});

Tuning options

OptionTypeDefaultDescription
silenceThresholdnumber0.04RMS below this is silence
silenceDurationMsnumber500Silence duration before end_of_speech (ms)
interruptThresholdnumber0.05RMS to detect speech during playback
interruptChunksnumber2Consecutive high-RMS chunks to trigger interrupt

Changing tuning options triggers a client reconnect (the connection key includes them).

useVoiceInput

Lightweight hook for dictation and voice-to-text. Accumulates user transcripts into a single string.

import { useVoiceInput } from "@cloudflare/voice/react";
function Dictation() {
const {
transcript, // string — accumulated text from all utterances
interimTranscript, // string | null — current partial transcript
isListening, // boolean
audioLevel, // number (0–1)
isMuted, // boolean
error, // string | null
start, // () => Promise<void>
stop, // () => void
toggleMute, // () => void
clear, // () => void — clear accumulated transcript
} = useVoiceInput({ agent: "DictationAgent" });
return (
<div>
<textarea
value={
transcript + (interimTranscript ? " " + interimTranscript : "")
}
readOnly
/>
<button onClick={isListening ? stop : start}>
{isListening ? "Stop" : "Dictate"}
</button>
</div>
);
}

Client API: VoiceClient

Framework-agnostic client for environments without React.

JavaScript
import { VoiceClient } from "@cloudflare/voice/client";
const client = new VoiceClient({ agent: "MyAgent" });
client.addEventListener("statuschange", (status) => {
console.log("Status:", status);
});
client.addEventListener("transcriptchange", (messages) => {
console.log("Transcript:", messages);
});
client.addEventListener("error", (err) => {
console.error("Error:", err);
});
client.connect();
await client.startCall();
// Later:
client.endCall();
client.disconnect();

Events

EventData typeDescription
statuschangeVoiceStatusPipeline state changed
transcriptchangeTranscriptMessage[]Transcript updated
interimtranscriptstring | nullInterim transcript from streaming STT
metricschangeVoicePipelineMetricsPipeline timing metrics
audiolevelchangenumberMic audio level (0–1)
connectionchangebooleanWebSocket connected/disconnected
mutechangebooleanMute state changed
errorstring | nullError occurred
custommessageunknownNon-voice message from server

Advanced options

OptionTypeDescription
transportVoiceTransportCustom transport (default: WebSocket via PartySocket)
audioInputVoiceAudioInputCustom mic capture (default: built-in AudioWorklet)
preferredFormatVoiceAudioFormatHint for server audio format (advisory only)

Providers

Built-in (Workers AI)

No API keys required — use your Workers AI binding:

ClassTypeDefault modelRecommended for
WorkersAIFluxSTTContinuous STT@cf/deepgram/fluxwithVoice
WorkersAINova3STTContinuous STT@cf/deepgram/nova-3withVoiceInput
WorkersAITTSTTS@cf/deepgram/aura-1Both
JavaScript
import { Agent } from "agents";
import {
withVoice,
WorkersAIFluxSTT,
WorkersAINova3STT,
WorkersAITTS,
} from "@cloudflare/voice";
const VoiceAgent = withVoice(Agent);
// Default usage
export class MyAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
}
// Custom options
export class CustomAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI, {
eotThreshold: 0.8,
keyterms: ["Cloudflare", "Workers"],
});
tts = new WorkersAITTS(this.env.AI, {
model: "@cf/deepgram/aura-1",
speaker: "asteria",
});
}

Third-party providers

PackageClassDescription
@cloudflare/voice-deepgramDeepgramSTTContinuous STT
@cloudflare/voice-elevenlabsElevenLabsTTSHigh-quality TTS
@cloudflare/voice-twilioTwilioAdapterTelephony (phone calls)

ElevenLabs TTS:

JavaScript
import { ElevenLabsTTS } from "@cloudflare/voice-elevenlabs";
export class MyAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new ElevenLabsTTS({
apiKey: this.env.ELEVENLABS_API_KEY,
voiceId: "21m00Tcm4TlvDq8ikWAM",
});
}

Deepgram STT:

JavaScript
import { DeepgramSTT } from "@cloudflare/voice-deepgram";
export class MyAgent extends VoiceAgent {
transcriber = new DeepgramSTT({
apiKey: this.env.DEEPGRAM_API_KEY,
});
tts = new WorkersAITTS(this.env.AI);
}

Telephony (Twilio)

Connect phone calls to your voice agent using the Twilio adapter:

Terminal window
npm install @cloudflare/voice-twilio

The adapter bridges Twilio Media Streams to your VoiceAgent:

Phone → Twilio → WebSocket → TwilioAdapter → WebSocket → VoiceAgent

WorkersAITTS returns MP3, which cannot be decoded to PCM in the Workers runtime. When using the Twilio adapter, use a TTS provider that outputs raw PCM (for example, ElevenLabs with outputFormat: "pcm_16000").

Text messages

withVoice agents can also receive text messages, bypassing STT entirely. This is useful for chat-style input alongside voice.

const { sendText } = useVoiceAgent({ agent: "MyAgent" });
// Send text — goes straight to onTurn() without STT
sendText("What is the weather like today?");

Text messages work both during and outside of active calls. During a call, the response is spoken aloud via TTS. Outside a call, the response is sent as text-only transcript messages.

Custom messages

Send and receive application-level JSON messages alongside voice protocol messages. Non-voice messages pass through to your onMessage handler on the server and emit custommessage events on the client.

Server:

JavaScript
export class MyAgent extends VoiceAgent {
onMessage(connection, message) {
const data = JSON.parse(message);
if (data.type === "kick_speaker") {
this.forceEndCall(connection);
}
}
}

Client:

const { sendJSON, lastCustomMessage } = useVoiceAgent({ agent: "MyAgent" });
sendJSON({ type: "kick_speaker" });
useEffect(() => {
if (lastCustomMessage) {
console.log("Custom message:", lastCustomMessage);
}
}, [lastCustomMessage]);

Single-speaker enforcement

Use beforeCallStart to restrict who can start a call. This example enforces single-speaker — only one connection can be the active speaker at a time:

JavaScript
import {} from "agents";
export class MyAgent extends VoiceAgent {
#speakerId = null;
beforeCallStart(connection) {
if (this.#speakerId !== null) {
return false;
}
this.#speakerId = connection.id;
return true;
}
onCallEnd(connection) {
if (this.#speakerId === connection.id) {
this.#speakerId = null;
}
}
}

Pipeline metrics

withVoice agents emit timing metrics after each turn:

const { metrics } = useVoiceAgent({ agent: "MyAgent" });
// metrics: {
// llm_ms: 850,
// tts_ms: 200,
// first_audio_ms: 950,
// total_ms: 1200,
// }

Conversation history

withVoice automatically persists conversation messages to SQLite. Access history in your onTurn via context.messages, or directly:

JavaScript
const history = this.getConversationHistory(20);
this.saveMessage("assistant", "Welcome! How can I help?");

History survives Durable Object restarts and client reconnections. Voice agents use keepAlive to prevent eviction during active calls.