Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

语音 agent

构建实时语音 agent,带语音转文字 (STT)、文字转语音 (TTS) 和会话持久化。音频通过 WebSocket 流式传输 — 不需要 SFU 或会议基础设施。Beta

概览

@cloudflare/voice 提供了两个服务端 mixin 和对应的客户端库:

导出引入用途
withVoice@cloudflare/voice完整语音 agent:STT、LLM、TTS、持久化
withVoiceInput@cloudflare/voice仅 STT:转录但不响应
useVoiceAgent@cloudflare/voice/reactwithVoice agent 的 React hook
useVoiceInput@cloudflare/voice/reactwithVoiceInput agent 的 React hook
VoiceClient@cloudflare/voice/client框架无关的客户端

构建在 Cloudflare Durable Objects 上,你可以获得:

  • 实时音频 — 麦克风音频以二进制 WebSocket 帧流式传输,TTS 音频流式返回
  • 会话自动持久化 — 消息存储在 SQLite,挺过重启
  • 流式 TTS — LLM token 按句子分块,并发合成
  • 打断处理 — 播放期间用户开始说话会取消当前响应
  • 持续 STT — 每次通话独立的转录器会话,模型负责检测说话轮次
  • 管线 hook — 在每个阶段拦截和转换文本

快速开始

安装

Terminal 窗口


npm install @cloudflare/voice agents


服务端

JavaScript


import { Agent } from "agents";

import { withVoice, WorkersAIFluxSTT, WorkersAITTS } from "@cloudflare/voice";


const VoiceAgent = withVoice(Agent);


export class MyAgent extends VoiceAgent {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  async onTurn(transcript, context) {

    return "Hello! I heard you say: " + transcript;

  }

}


Explain Code

TypeScript


import { Agent } from "agents";

import {

  withVoice,

  WorkersAIFluxSTT,

  WorkersAITTS,

  type VoiceTurnContext,

} from "@cloudflare/voice";


const VoiceAgent = withVoice(Agent);


export class MyAgent extends VoiceAgent<Env> {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  async onTurn(transcript: string, context: VoiceTurnContext) {

    return "Hello! I heard you say: " + transcript;

  }

}


Explain Code

客户端 (React)


import { useVoiceAgent } from "@cloudflare/voice/react";


function VoiceUI() {

  const {

    status,

    transcript,

    interimTranscript,

    audioLevel,

    isMuted,

    startCall,

    endCall,

    toggleMute,

  } = useVoiceAgent({ agent: "MyAgent" });


  return (

    <div>

      <p>Status: {status}</p>


      <button onClick={status === "idle" ? startCall : endCall}>

        {status === "idle" ? "Start Call" : "End Call"}

      </button>


      <button onClick={toggleMute}>{isMuted ? "Unmute" : "Mute"}</button>


      {interimTranscript && (

        <p>

          <em>{interimTranscript}</em>

        </p>

      )}


      {transcript.map((msg, i) => (

        <p key={i}>

          <strong>{msg.role}:</strong> {msg.text}

        </p>

      ))}

    </div>

  );

}


Explain Code

Wrangler 配置

JSONC


{

  "ai": {

    "binding": "AI"

  },

  "durable_objects": {

    "bindings": [

      {

        "name": "MyAgent",

        "class_name": "MyAgent"

      }

    ]

  },

  "migrations": [

    {

      "tag": "v1",

      "new_sqlite_classes": ["MyAgent"]

    }

  ]

}


Explain Code

TOML


[ai]

binding = "AI"


[[durable_objects.bindings]]

name = "MyAgent"

class_name = "MyAgent"


[[migrations]]

tag = "v1"

new_sqlite_classes = [ "MyAgent" ]


Explain Code

工作原理


Browser                              Durable Object (withVoice)

┌──────────┐                         ┌──────────────────────────┐

│ Mic      │   binary PCM (16kHz)    │ Transcriber session      │

│          │ ──────────────────────► │ (per-call, continuous)   │

│          │                         │   ↓ model detects turn   │

│          │   JSON: transcript      │ onTurn() → your LLM code │

│          │ ◄────────────────────── │   ↓ (sentence chunking)  │

│          │   binary: audio         │ TTS                      │

│ Speaker  │ ◄────────────────────── │                          │

└──────────┘                         └──────────────────────────┘


Explain Code

  1. 客户端捕获麦克风音频,以二进制 WebSocket 帧发送(16kHz 单声道 16 位 PCM)。
  2. 音频持续流向转录器会话(在 start_call 时创建,贯穿整个通话)。
  3. STT 模型检测用户何时说完一段话,触发 onUtterance。所有提供商都使用模型驱动的轮次检测 — 客户端不需要为 STT 发送说话结束信号。
  4. 你的 onTurn() 方法运行 — 通常是一次 LLM 调用。
  5. 响应按句子分块,通过 TTS 合成。
  6. 音频流回客户端播放。

客户端在用户说话时会收到 transcript_interim 消息,包含部分结果,这样你可以在 UI 上显示实时反馈。

服务端 API:withVoice

withVoice(Agent) 把完整的语音管线添加到 Agent 类中。

Providers

把 provider 设为类属性。类字段初始化在 super() 之后运行,所以 this.env 可用。

属性类型必需描述
transcriberTranscriber每次通话持续的 STT provider
ttsTTSProvider文字转语音

JavaScript


import { withVoice, WorkersAIFluxSTT, WorkersAITTS } from "@cloudflare/voice";


const VoiceAgent = withVoice(Agent);


export class MyAgent extends VoiceAgent {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);

}


TypeScript


import { withVoice, WorkersAIFluxSTT, WorkersAITTS } from "@cloudflare/voice";


const VoiceAgent = withVoice(Agent);


export class MyAgent extends VoiceAgent<Env> {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);

}


如果需要在运行时切换模型(例如 Flux 和 Nova 3 之间的下拉切换),重写 createTranscriber:

JavaScript


export class MyAgent extends VoiceAgent {

  tts = new WorkersAITTS(this.env.AI);


  createTranscriber(connection) {

    return new WorkersAIFluxSTT(this.env.AI);

  }

}


TypeScript


export class MyAgent extends VoiceAgent<Env> {

  tts = new WorkersAITTS(this.env.AI);


  createTranscriber(connection: Connection): Transcriber {

    return new WorkersAIFluxSTT(this.env.AI);

  }

}


onTurn(transcript, context)

必需。 当用户说完一段话、转录完成时被调用。

返回 stringAsyncIterable<string>ReadableStream 以支持流式响应。

简单响应:

JavaScript


export class MyAgent extends VoiceAgent {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  async onTurn(transcript, context) {

    return "You said: " + transcript;

  }

}


TypeScript


export class MyAgent extends VoiceAgent<Env> {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  async onTurn(transcript: string, context: VoiceTurnContext) {

    return "You said: " + transcript;

  }

}


流式响应(LLM 推荐方式):

JavaScript


import { streamText } from "ai";

import { createWorkersAI } from "workers-ai-provider";


export class MyAgent extends VoiceAgent {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  async onTurn(transcript, context) {

    const workersai = createWorkersAI({ binding: this.env.AI });


    const result = streamText({

      model: workersai("@cf/moonshotai/kimi-k2.5"),

      system: "You are a helpful voice assistant. Keep responses concise.",

      messages: [

        ...context.messages.map((m) => ({

          role: m.role,

          content: m.content,

        })),

        { role: "user", content: transcript },

      ],

      abortSignal: context.signal,

    });


    return result.textStream;

  }

}


Explain Code

TypeScript


import { streamText } from "ai";

import { createWorkersAI } from "workers-ai-provider";


export class MyAgent extends VoiceAgent<Env> {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  async onTurn(transcript: string, context: VoiceTurnContext) {

    const workersai = createWorkersAI({ binding: this.env.AI });


    const result = streamText({

      model: workersai("@cf/moonshotai/kimi-k2.5"),

      system: "You are a helpful voice assistant. Keep responses concise.",

      messages: [

        ...context.messages.map(m => ({

          role: m.role as "user" | "assistant",

          content: m.content,

        })),

        { role: "user", content: transcript },

      ],

      abortSignal: context.signal,

    });


    return result.textStream;

  }

}


Explain Code

context 对象提供:

字段类型描述
connectionConnectionWebSocket 连接
messagesArray<{ role: string; content: string }>来自 SQLite 的会话历史
signalAbortSignal在打断或断开连接时被中止

生命周期 hook

方法描述
beforeCallStart(connection)返回 false 拒绝该次通话
onCallStart(connection)通话被接受后调用
onCallEnd(connection)通话结束时调用
onInterrupt(connection)用户在播放期间打断时调用

管线 hook

在每个管线阶段拦截和转换数据。返回 null 跳过当前这次发言。

方法接收是否可跳过
afterTranscribe(transcript, connection)STT 文本
beforeSynthesize(text, connection)TTS 之前的文本
afterSynthesize(audio, text, connection)TTS 之后的音频

JavaScript


import {} from "agents";


export class MyAgent extends VoiceAgent {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  afterTranscribe(transcript, connection) {

    if (transcript.length < 3) return null;

    return transcript;

  }


  beforeSynthesize(text, connection) {

    return text.replace(/\bAI\b/g, "A.I.");

  }


  async onTurn(transcript, context) {

    return transcript;

  }

}


Explain Code

TypeScript


import { type Connection } from "agents";


export class MyAgent extends VoiceAgent<Env> {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  afterTranscribe(transcript: string, connection: Connection) {

    if (transcript.length < 3) return null;

    return transcript;

  }


  beforeSynthesize(text: string, connection: Connection) {

    return text.replace(/\bAI\b/g, "A.I.");

  }


  async onTurn(transcript: string, context: VoiceTurnContext) {

    return transcript;

  }

}


Explain Code

便捷方法

方法描述
speak(connection, text)合成音频并发送给一个连接
speakAll(text)合成音频并发送给所有连接
forceEndCall(connection)程序化结束一次通话
saveMessage(role, text)把一条消息持久化到会话历史
getConversationHistory()从 SQLite 读取会话历史

配置选项

把选项作为第二个参数传给 withVoice():

JavaScript


const VoiceAgent = withVoice(Agent, {

  historyLimit: 20,

  audioFormat: "mp3",

  maxMessageCount: 1000,

});


TypeScript


const VoiceAgent = withVoice(Agent, {

  historyLimit: 20,

  audioFormat: "mp3",

  maxMessageCount: 1000,

});


选项类型默认值描述
historyLimitnumber20加载到上下文的最大消息数
audioFormatstring“mp3”发送给客户端的音频格式
maxMessageCountnumber1000SQLite 中存储的最大消息数

服务端 API:withVoiceInput

withVoiceInput(Agent) 添加只支持 STT 的语音输入 — 没有 TTS、没有 LLM、不生成响应。适用于听写、语音搜索,或任何只需要语音转文字、不需要会话 agent 的 UI 场景。

JavaScript


import { Agent } from "agents";

import { withVoiceInput, WorkersAINova3STT } from "@cloudflare/voice";


const InputAgent = withVoiceInput(Agent);


export class DictationAgent extends InputAgent {

  transcriber = new WorkersAINova3STT(this.env.AI);


  onTranscript(text, connection) {

    console.log("User said:", text);

  }

}


Explain Code

TypeScript


import { Agent } from "agents";

import { withVoiceInput, WorkersAINova3STT } from "@cloudflare/voice";


const InputAgent = withVoiceInput(Agent);


export class DictationAgent extends InputAgent<Env> {

  transcriber = new WorkersAINova3STT(this.env.AI);


  onTranscript(text: string, connection: Connection) {

    console.log("User said:", text);

  }

}


Explain Code

onTranscript(text, connection)

每次发言转录后被调用。重写它以处理转录文本。

Hooks

withVoiceInput 支持与 withVoice 相同的生命周期 hook:

  • beforeCallStart(connection) — 返回 false 表示拒绝
  • onCallStart(connection)onCallEnd(connection)onInterrupt(connection)
  • createTranscriber(connection) — 重写以支持运行时模型切换
  • afterTranscribe(transcript, connection) — 过滤或转换转录文本

没有 TTS hook(beforeSynthesizeafterSynthesize)或 onTurn

客户端 API:React hooks

useVoiceAgent

封装了用于 withVoice agent 的 VoiceClient。管理连接、麦克风采集、播放、静音检测和打断检测。


import { useVoiceAgent } from "@cloudflare/voice/react";


const {

  status, // "idle" | "listening" | "thinking" | "speaking"

  transcript, // TranscriptMessage[] — conversation history

  interimTranscript, // string | null — real-time partial transcript

  metrics, // VoicePipelineMetrics | null

  audioLevel, // number (0–1) — current mic RMS level

  isMuted, // boolean

  connected, // boolean — WebSocket connected

  error, // string | null

  startCall, // () => Promise<void>

  endCall, // () => void

  toggleMute, // () => void

  sendText, // (text: string) => void — bypass STT

  sendJSON, // (data: Record<string, unknown>) => void

  lastCustomMessage, // unknown — last non-voice message from server

} = useVoiceAgent({

  agent: "MyAgent",

  name: "default",

  host: window.location.host,

});


Explain Code

调优选项

选项类型默认值描述
silenceThresholdnumber0.04低于此 RMS 视为静音
silenceDurationMsnumber500触发 end_of_speech 之前的静音时长(毫秒)
interruptThresholdnumber0.05检测到播放期间说话的 RMS 阈值
interruptChunksnumber2触发打断所需的连续高 RMS 块数

修改调优选项会触发客户端重连(连接 key 包含这些参数)。

useVoiceInput

适用于听写和语音转文字的轻量级 hook。把所有发言累积到一个字符串中。


import { useVoiceInput } from "@cloudflare/voice/react";


function Dictation() {

  const {

    transcript, // string — accumulated text from all utterances

    interimTranscript, // string | null — current partial transcript

    isListening, // boolean

    audioLevel, // number (0–1)

    isMuted, // boolean

    error, // string | null

    start, // () => Promise<void>

    stop, // () => void

    toggleMute, // () => void

    clear, // () => void — clear accumulated transcript

  } = useVoiceInput({ agent: "DictationAgent" });


  return (

    <div>

      <textarea

        value={

          transcript + (interimTranscript ? " " + interimTranscript : "")

        }

        readOnly

      />

      <button onClick={isListening ? stop : start}>

        {isListening ? "Stop" : "Dictate"}

      </button>

    </div>

  );

}


Explain Code

客户端 API:VoiceClient

适用于非 React 环境的框架无关客户端。

JavaScript


import { VoiceClient } from "@cloudflare/voice/client";


const client = new VoiceClient({ agent: "MyAgent" });


client.addEventListener("statuschange", (status) => {

  console.log("Status:", status);

});


client.addEventListener("transcriptchange", (messages) => {

  console.log("Transcript:", messages);

});


client.addEventListener("error", (err) => {

  console.error("Error:", err);

});


client.connect();

await client.startCall();


// Later:

client.endCall();

client.disconnect();


Explain Code

TypeScript


import { VoiceClient } from "@cloudflare/voice/client";


const client = new VoiceClient({ agent: "MyAgent" });


client.addEventListener("statuschange", (status) => {

  console.log("Status:", status);

});


client.addEventListener("transcriptchange", (messages) => {

  console.log("Transcript:", messages);

});


client.addEventListener("error", (err) => {

  console.error("Error:", err);

});


client.connect();

await client.startCall();


// Later:

client.endCall();

client.disconnect();


Explain Code

事件

事件数据类型描述
statuschangeVoiceStatus管线状态发生变化
transcriptchangeTranscriptMessage[]转录文本更新
interimtranscriptstring | null来自流式 STT 的临时转录
metricschangeVoicePipelineMetrics管线计时指标
audiolevelchangenumber麦克风音频电平 (0–1)
connectionchangebooleanWebSocket 连接/断开
mutechangeboolean静音状态变化
errorstring | null发生错误
custommessageunknown来自服务端的非语音消息

高级选项

选项类型描述
transportVoiceTransport自定义传输(默认通过 PartySocket 的 WebSocket)
audioInputVoiceAudioInput自定义麦克风采集(默认是内置的 AudioWorklet)
preferredFormatVoiceAudioFormat服务端音频格式的提示(仅作建议)

Providers

内置(Workers AI)

无需 API key — 使用你的 Workers AI binding:

类型默认模型推荐用于
WorkersAIFluxSTT持续 STT@cf/deepgram/fluxwithVoice
WorkersAINova3STT持续 STT@cf/deepgram/nova-3withVoiceInput
WorkersAITTSTTS@cf/deepgram/aura-1两者皆可

JavaScript


import { Agent } from "agents";

import {

  withVoice,

  WorkersAIFluxSTT,

  WorkersAINova3STT,

  WorkersAITTS,

} from "@cloudflare/voice";


const VoiceAgent = withVoice(Agent);


// Default usage

export class MyAgent extends VoiceAgent {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);

}


// Custom options

export class CustomAgent extends VoiceAgent {

  transcriber = new WorkersAIFluxSTT(this.env.AI, {

    eotThreshold: 0.8,

    keyterms: ["Cloudflare", "Workers"],

  });

  tts = new WorkersAITTS(this.env.AI, {

    model: "@cf/deepgram/aura-1",

    speaker: "asteria",

  });

}


Explain Code

TypeScript


import { Agent } from "agents";

import {

  withVoice,

  WorkersAIFluxSTT,

  WorkersAINova3STT,

  WorkersAITTS,

} from "@cloudflare/voice";


const VoiceAgent = withVoice(Agent);


// Default usage

export class MyAgent extends VoiceAgent<Env> {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);

}


// Custom options

export class CustomAgent extends VoiceAgent<Env> {

  transcriber = new WorkersAIFluxSTT(this.env.AI, {

    eotThreshold: 0.8,

    keyterms: ["Cloudflare", "Workers"],

  });

  tts = new WorkersAITTS(this.env.AI, {

    model: "@cf/deepgram/aura-1",

    speaker: "asteria",

  });

}


Explain Code

第三方 providers

描述
@cloudflare/voice-deepgramDeepgramSTT持续 STT
@cloudflare/voice-elevenlabsElevenLabsTTS高质量 TTS
@cloudflare/voice-twilioTwilioAdapter电话(电话呼叫)

ElevenLabs TTS:

JavaScript


import { ElevenLabsTTS } from "@cloudflare/voice-elevenlabs";


export class MyAgent extends VoiceAgent {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new ElevenLabsTTS({

    apiKey: this.env.ELEVENLABS_API_KEY,

    voiceId: "21m00Tcm4TlvDq8ikWAM",

  });

}


TypeScript


import { ElevenLabsTTS } from "@cloudflare/voice-elevenlabs";


export class MyAgent extends VoiceAgent<Env> {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new ElevenLabsTTS({

    apiKey: this.env.ELEVENLABS_API_KEY,

    voiceId: "21m00Tcm4TlvDq8ikWAM",

  });

}


Deepgram STT:

JavaScript


import { DeepgramSTT } from "@cloudflare/voice-deepgram";


export class MyAgent extends VoiceAgent {

  transcriber = new DeepgramSTT({

    apiKey: this.env.DEEPGRAM_API_KEY,

  });

  tts = new WorkersAITTS(this.env.AI);

}


TypeScript


import { DeepgramSTT } from "@cloudflare/voice-deepgram";


export class MyAgent extends VoiceAgent<Env> {

  transcriber = new DeepgramSTT({

    apiKey: this.env.DEEPGRAM_API_KEY,

  });

  tts = new WorkersAITTS(this.env.AI);

}


电话(Twilio)

通过 Twilio adapter 把电话呼叫接到你的语音 agent:

Terminal 窗口


npm install @cloudflare/voice-twilio


Adapter 把 Twilio Media Streams 桥接到你的 VoiceAgent:


Phone → Twilio → WebSocket → TwilioAdapter → WebSocket → VoiceAgent


WorkersAITTS 返回 MP3,在 Workers 运行时无法解码为 PCM。使用 Twilio adapter 时,请使用输出原始 PCM 的 TTS provider(例如 ElevenLabs,设置 outputFormat: "pcm_16000")。

文本消息

withVoice agent 也可以接收文本消息,完全跳过 STT。这对于在语音之外提供聊天式输入很有用。


const { sendText } = useVoiceAgent({ agent: "MyAgent" });


// Send text — goes straight to onTurn() without STT

sendText("What is the weather like today?");


文本消息在通话中和通话外都能工作。通话中,响应通过 TTS 朗读。通话外,响应作为纯文本转录消息发送。

自定义消息

在语音协议消息之外,发送和接收应用层 JSON 消息。非语音消息会传递到服务端的 onMessage handler,并在客户端触发 custommessage 事件。

服务端:

JavaScript


export class MyAgent extends VoiceAgent {

  onMessage(connection, message) {

    const data = JSON.parse(message);

    if (data.type === "kick_speaker") {

      this.forceEndCall(connection);

    }

  }

}


TypeScript


export class MyAgent extends VoiceAgent<Env> {

  onMessage(connection: Connection, message: WSMessage) {

    const data = JSON.parse(message as string);

    if (data.type === "kick_speaker") {

      this.forceEndCall(connection);

    }

  }

}


客户端:


const { sendJSON, lastCustomMessage } = useVoiceAgent({ agent: "MyAgent" });


sendJSON({ type: "kick_speaker" });


useEffect(() => {

  if (lastCustomMessage) {

    console.log("Custom message:", lastCustomMessage);

  }

}, [lastCustomMessage]);


单一发言者强制

beforeCallStart 限制谁可以发起通话。这个例子强制单一发言者 — 同一时间只允许一个连接成为活跃发言者:

JavaScript


import {} from "agents";


export class MyAgent extends VoiceAgent {

  #speakerId = null;


  beforeCallStart(connection) {

    if (this.#speakerId !== null) {

      return false;

    }

    this.#speakerId = connection.id;

    return true;

  }


  onCallEnd(connection) {

    if (this.#speakerId === connection.id) {

      this.#speakerId = null;

    }

  }

}


Explain Code

TypeScript


import { type Connection } from "agents";


export class MyAgent extends VoiceAgent<Env> {

  #speakerId: string | null = null;


  beforeCallStart(connection: Connection) {

    if (this.#speakerId !== null) {

      return false;

    }

    this.#speakerId = connection.id;

    return true;

  }


  onCallEnd(connection: Connection) {

    if (this.#speakerId === connection.id) {

      this.#speakerId = null;

    }

  }

}


Explain Code

管线指标

withVoice agent 在每次说话轮次结束后会发出计时指标:


const { metrics } = useVoiceAgent({ agent: "MyAgent" });


// metrics: {

//   llm_ms: 850,

//   tts_ms: 200,

//   first_audio_ms: 950,

//   total_ms: 1200,

// }


会话历史

withVoice 自动把会话消息持久化到 SQLite。在 onTurn 中通过 context.messages 访问历史,或直接调用:

JavaScript


const history = this.getConversationHistory(20);


this.saveMessage("assistant", "Welcome! How can I help?");


TypeScript


const history = this.getConversationHistory(20);


this.saveMessage("assistant", "Welcome! How can I help?");


历史挺过 Durable Object 重启和客户端重连。语音 agent 在通话期间使用 keepAlive 防止被驱逐。