构建语音 Agent
构建一个语音 Agent,它聆听用户、用 LLM 思考,然后用语音回复——所有这些都通过 WebSocket 实时进行。Beta
读完本指南后,你将拥有:
- 一个具有语音转文字和文字转语音的服务端语音 Agent
- 一个由 LLM 驱动、流式返回响应的
onTurn处理器 - Agent 可以在对话期间调用的工具
- 一个具有“按住说话“风格 UI 的 React 客户端
前置条件
- 一个具有 Workers AI 访问权限的 Cloudflare 账户
- Node.js 18+
1. 创建项目
使用 Vite 和 React 搭建一个新的 Workers 项目,然后安装语音相关依赖:
Terminal window
npm create cloudflare@latest voice-agent -- --template cloudflare/agents-starter
cd voice-agent
npm install @cloudflare/voice
starter 模板提供了可工作的 Vite + React + Cloudflare Workers 设置。你将在后续步骤中替换服务端和客户端代码。
2. 配置 wrangler
更新 wrangler.jsonc 以包含一个 Workers AI 绑定和一个用于你的语音 Agent 的 Durable Object:
JSONC
{
"name": "voice-agent",
// Set this to today's date
"compatibility_date": "2026-04-29",
"compatibility_flags": ["nodejs_compat"],
"main": "src/server.ts",
"ai": {
"binding": "AI"
},
"durable_objects": {
"bindings": [
{
"name": "MyVoiceAgent",
"class_name": "MyVoiceAgent"
}
]
},
"migrations": [
{
"tag": "v1",
"new_sqlite_classes": ["MyVoiceAgent"]
}
]
}
Explain Code
TOML
name = "voice-agent"
# Set this to today's date
compatibility_date = "2026-04-29"
compatibility_flags = [ "nodejs_compat" ]
main = "src/server.ts"
[ai]
binding = "AI"
[[durable_objects.bindings]]
name = "MyVoiceAgent"
class_name = "MyVoiceAgent"
[[migrations]]
tag = "v1"
new_sqlite_classes = [ "MyVoiceAgent" ]
Explain Code
3. 构建服务端
将 src/server.ts 替换为以下内容。withVoice mixin 为标准 Agent 类添加了完整的语音流水线——STT、句子分块、TTS 和对话持久化。
JavaScript
import { Agent, routeAgentRequest } from "agents";
import { withVoice, WorkersAIFluxSTT, WorkersAITTS } from "@cloudflare/voice";
import { streamText, tool, stepCountIs } from "ai";
import { createWorkersAI } from "workers-ai-provider";
import { z } from "zod";
const VoiceAgent = withVoice(Agent);
export class MyVoiceAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
async onTurn(transcript, context) {
const workersAi = createWorkersAI({ binding: this.env.AI });
const result = streamText({
model: workersAi("@cf/moonshotai/kimi-k2.5"),
system:
"You are a helpful voice assistant. Keep responses concise — you are being spoken aloud.",
messages: [
...context.messages.map((m) => ({
role: m.role,
content: m.content,
})),
{ role: "user", content: transcript },
],
tools: {
get_current_time: tool({
description: "Get the current date and time.",
inputSchema: z.object({}),
execute: async () => ({
time: new Date().toLocaleTimeString("en-US", {
hour: "2-digit",
minute: "2-digit",
}),
}),
}),
},
stopWhen: stepCountIs(3),
abortSignal: context.signal,
});
return result.textStream;
}
async onCallStart(connection) {
await this.speak(connection, "Hi there! How can I help you today?");
}
}
export default {
async fetch(request, env) {
return (
(await routeAgentRequest(request, env)) ??
new Response("Not found", { status: 404 })
);
},
};
Explain Code
TypeScript
import { Agent, routeAgentRequest, type Connection } from "agents";
import {
withVoice,
WorkersAIFluxSTT,
WorkersAITTS,
type VoiceTurnContext,
} from "@cloudflare/voice";
import { streamText, tool, stepCountIs } from "ai";
import { createWorkersAI } from "workers-ai-provider";
import { z } from "zod";
const VoiceAgent = withVoice(Agent);
export class MyVoiceAgent extends VoiceAgent<Env> {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
async onTurn(transcript: string, context: VoiceTurnContext) {
const workersAi = createWorkersAI({ binding: this.env.AI });
const result = streamText({
model: workersAi("@cf/moonshotai/kimi-k2.5"),
system:
"You are a helpful voice assistant. Keep responses concise — you are being spoken aloud.",
messages: [
...context.messages.map((m) => ({
role: m.role as "user" | "assistant",
content: m.content,
})),
{ role: "user" as const, content: transcript },
],
tools: {
get_current_time: tool({
description: "Get the current date and time.",
inputSchema: z.object({}),
execute: async () => ({
time: new Date().toLocaleTimeString("en-US", {
hour: "2-digit",
minute: "2-digit",
}),
}),
}),
},
stopWhen: stepCountIs(3),
abortSignal: context.signal,
});
return result.textStream;
}
async onCallStart(connection: Connection) {
await this.speak(connection, "Hi there! How can I help you today?");
}
}
export default {
async fetch(request: Request, env: Env) {
return (
(await routeAgentRequest(request, env)) ??
new Response("Not found", { status: 404 })
);
},
} satisfies ExportedHandler<Env>;
Explain Code
要点:
WorkersAIFluxSTT处理连续的语音转文字——模型会检测用户何时说完。WorkersAITTS将 LLM 响应逐句转换为音频。onTurn接收转写文本并返回一个流。mixin 负责将该流分割成句子并合成每一句。onCallStart在用户连接时发送问候语。context.messages包含来自 SQLite 的完整对话历史。- 当用户打断或断开连接时,
context.signal会被中止。
4. 构建客户端
将 src/client.tsx 替换为一个使用 useVoiceAgent hook 的 React 组件。该 hook 管理 WebSocket 连接、麦克风采集、音频播放和打断检测。
import { useVoiceAgent } from "@cloudflare/voice/react";
function App() {
const {
status,
transcript,
interimTranscript,
metrics,
audioLevel,
isMuted,
startCall,
endCall,
toggleMute,
} = useVoiceAgent({ agent: "MyVoiceAgent" });
return (
<div>
<h1>Voice Agent</h1>
<p>Status: {status}</p>
<div>
<button onClick={status === "idle" ? startCall : endCall}>
{status === "idle" ? "Start Call" : "End Call"}
</button>
{status !== "idle" && (
<button onClick={toggleMute}>
{isMuted ? "Unmute" : "Mute"}
</button>
)}
</div>
{interimTranscript && (
<p>
<em>{interimTranscript}</em>
</p>
)}
{transcript.map((msg, i) => (
<p key={i}>
<strong>{msg.role}:</strong> {msg.text}
</p>
))}
{metrics && (
<p>
LLM: {metrics.llm_ms}ms | TTS: {metrics.tts_ms}ms | First
audio: {metrics.first_audio_ms}ms
</p>
)}
</div>
);
}
Explain Code
status 字段会在 "idle" → "listening" → "thinking" → "speaking" → "listening" 之间循环,提供你构建响应式 UI 所需的一切。
5. 运行
Terminal window
npm run dev
在浏览器中打开应用,选择 Start Call 并开始说话。你会看到转写文本实时出现,Agent 的响应会从你的扬声器播放出来。
添加流水线 hook
你可以在流水线的每个阶段拦截并转换数据。例如,过滤掉过短的转写(噪音)并在 TTS 之前调整发音:
JavaScript
export class MyVoiceAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
afterTranscribe(transcript, connection) {
if (transcript.length < 3) return null;
return transcript;
}
beforeSynthesize(text, connection) {
return text.replace(/\bAI\b/g, "A.I.");
}
async onTurn(transcript, context) {
return "You said: " + transcript;
}
}
Explain Code
TypeScript
export class MyVoiceAgent extends VoiceAgent<Env> {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
afterTranscribe(transcript: string, connection: Connection) {
if (transcript.length < 3) return null;
return transcript;
}
beforeSynthesize(text: string, connection: Connection) {
return text.replace(/\bAI\b/g, "A.I.");
}
async onTurn(transcript: string, context: VoiceTurnContext) {
return "You said: " + transcript;
}
}
Explain Code
从 afterTranscribe 返回 null 会完全丢弃该次发言——可用于过滤噪音或非常短的转写。
使用第三方 provider
无需更改你的 Agent 逻辑,即可换用第三方 STT 或 TTS provider:
JavaScript
import { ElevenLabsTTS } from "@cloudflare/voice-elevenlabs";
import { DeepgramSTT } from "@cloudflare/voice-deepgram";
export class MyVoiceAgent extends VoiceAgent {
transcriber = new DeepgramSTT({
apiKey: this.env.DEEPGRAM_API_KEY,
});
tts = new ElevenLabsTTS({
apiKey: this.env.ELEVENLABS_API_KEY,
voiceId: "21m00Tcm4TlvDq8ikWAM",
});
async onTurn(transcript, context) {
return "You said: " + transcript;
}
}
Explain Code
TypeScript
import { ElevenLabsTTS } from "@cloudflare/voice-elevenlabs";
import { DeepgramSTT } from "@cloudflare/voice-deepgram";
export class MyVoiceAgent extends VoiceAgent<Env> {
transcriber = new DeepgramSTT({
apiKey: this.env.DEEPGRAM_API_KEY,
});
tts = new ElevenLabsTTS({
apiKey: this.env.ELEVENLABS_API_KEY,
voiceId: "21m00Tcm4TlvDq8ikWAM",
});
async onTurn(transcript: string, context: VoiceTurnContext) {
return "You said: " + transcript;
}
}
Explain Code
后续步骤
语音 Agent API 参考 withVoice、withVoiceInput、React hook、VoiceClient 和所有 provider 的完整参考。
聊天 Agent 使用 AIChatAgent 和 useAgentChat 构建基于文本的 AI 聊天。
使用 AI 模型 在 Agent 中使用 Workers AI、OpenAI、Anthropic、Gemini 或任何 provider。