Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

构建语音 Agent

构建一个语音 Agent,它聆听用户、用 LLM 思考,然后用语音回复——所有这些都通过 WebSocket 实时进行。Beta

读完本指南后,你将拥有:

  • 一个具有语音转文字和文字转语音的服务端语音 Agent
  • 一个由 LLM 驱动、流式返回响应的 onTurn 处理器
  • Agent 可以在对话期间调用的工具
  • 一个具有“按住说话“风格 UI 的 React 客户端

前置条件

  • 一个具有 Workers AI 访问权限的 Cloudflare 账户
  • Node.js 18+

1. 创建项目

使用 Vite 和 React 搭建一个新的 Workers 项目,然后安装语音相关依赖:

Terminal window


npm create cloudflare@latest voice-agent -- --template cloudflare/agents-starter

cd voice-agent

npm install @cloudflare/voice


starter 模板提供了可工作的 Vite + React + Cloudflare Workers 设置。你将在后续步骤中替换服务端和客户端代码。

2. 配置 wrangler

更新 wrangler.jsonc 以包含一个 Workers AI 绑定和一个用于你的语音 Agent 的 Durable Object:

JSONC


{

  "name": "voice-agent",

  // Set this to today's date

  "compatibility_date": "2026-04-29",

  "compatibility_flags": ["nodejs_compat"],

  "main": "src/server.ts",

  "ai": {

    "binding": "AI"

  },

  "durable_objects": {

    "bindings": [

      {

        "name": "MyVoiceAgent",

        "class_name": "MyVoiceAgent"

      }

    ]

  },

  "migrations": [

    {

      "tag": "v1",

      "new_sqlite_classes": ["MyVoiceAgent"]

    }

  ]

}


Explain Code

TOML


name = "voice-agent"

# Set this to today's date

compatibility_date = "2026-04-29"

compatibility_flags = [ "nodejs_compat" ]

main = "src/server.ts"


[ai]

binding = "AI"


[[durable_objects.bindings]]

name = "MyVoiceAgent"

class_name = "MyVoiceAgent"


[[migrations]]

tag = "v1"

new_sqlite_classes = [ "MyVoiceAgent" ]


Explain Code

3. 构建服务端

src/server.ts 替换为以下内容。withVoice mixin 为标准 Agent 类添加了完整的语音流水线——STT、句子分块、TTS 和对话持久化。

JavaScript


import { Agent, routeAgentRequest } from "agents";

import { withVoice, WorkersAIFluxSTT, WorkersAITTS } from "@cloudflare/voice";

import { streamText, tool, stepCountIs } from "ai";

import { createWorkersAI } from "workers-ai-provider";

import { z } from "zod";


const VoiceAgent = withVoice(Agent);


export class MyVoiceAgent extends VoiceAgent {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  async onTurn(transcript, context) {

    const workersAi = createWorkersAI({ binding: this.env.AI });


    const result = streamText({

      model: workersAi("@cf/moonshotai/kimi-k2.5"),

      system:

        "You are a helpful voice assistant. Keep responses concise — you are being spoken aloud.",

      messages: [

        ...context.messages.map((m) => ({

          role: m.role,

          content: m.content,

        })),

        { role: "user", content: transcript },

      ],

      tools: {

        get_current_time: tool({

          description: "Get the current date and time.",

          inputSchema: z.object({}),

          execute: async () => ({

            time: new Date().toLocaleTimeString("en-US", {

              hour: "2-digit",

              minute: "2-digit",

            }),

          }),

        }),

      },

      stopWhen: stepCountIs(3),

      abortSignal: context.signal,

    });


    return result.textStream;

  }


  async onCallStart(connection) {

    await this.speak(connection, "Hi there! How can I help you today?");

  }

}


export default {

  async fetch(request, env) {

    return (

      (await routeAgentRequest(request, env)) ??

      new Response("Not found", { status: 404 })

    );

  },

};


Explain Code

TypeScript


import { Agent, routeAgentRequest, type Connection } from "agents";

import {

  withVoice,

  WorkersAIFluxSTT,

  WorkersAITTS,

  type VoiceTurnContext,

} from "@cloudflare/voice";

import { streamText, tool, stepCountIs } from "ai";

import { createWorkersAI } from "workers-ai-provider";

import { z } from "zod";


const VoiceAgent = withVoice(Agent);


export class MyVoiceAgent extends VoiceAgent<Env> {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  async onTurn(transcript: string, context: VoiceTurnContext) {

    const workersAi = createWorkersAI({ binding: this.env.AI });


    const result = streamText({

      model: workersAi("@cf/moonshotai/kimi-k2.5"),

      system:

        "You are a helpful voice assistant. Keep responses concise — you are being spoken aloud.",

      messages: [

        ...context.messages.map((m) => ({

          role: m.role as "user" | "assistant",

          content: m.content,

        })),

        { role: "user" as const, content: transcript },

      ],

      tools: {

        get_current_time: tool({

          description: "Get the current date and time.",

          inputSchema: z.object({}),

          execute: async () => ({

            time: new Date().toLocaleTimeString("en-US", {

              hour: "2-digit",

              minute: "2-digit",

            }),

          }),

        }),

      },

      stopWhen: stepCountIs(3),

      abortSignal: context.signal,

    });


    return result.textStream;

  }


  async onCallStart(connection: Connection) {

    await this.speak(connection, "Hi there! How can I help you today?");

  }

}


export default {

  async fetch(request: Request, env: Env) {

    return (

      (await routeAgentRequest(request, env)) ??

      new Response("Not found", { status: 404 })

    );

  },

} satisfies ExportedHandler<Env>;


Explain Code

要点:

  • WorkersAIFluxSTT 处理连续的语音转文字——模型会检测用户何时说完。
  • WorkersAITTS 将 LLM 响应逐句转换为音频。
  • onTurn 接收转写文本并返回一个流。mixin 负责将该流分割成句子并合成每一句。
  • onCallStart 在用户连接时发送问候语。
  • context.messages 包含来自 SQLite 的完整对话历史。
  • 当用户打断或断开连接时,context.signal 会被中止。

4. 构建客户端

src/client.tsx 替换为一个使用 useVoiceAgent hook 的 React 组件。该 hook 管理 WebSocket 连接、麦克风采集、音频播放和打断检测。


import { useVoiceAgent } from "@cloudflare/voice/react";


function App() {

  const {

    status,

    transcript,

    interimTranscript,

    metrics,

    audioLevel,

    isMuted,

    startCall,

    endCall,

    toggleMute,

  } = useVoiceAgent({ agent: "MyVoiceAgent" });


  return (

    <div>

      <h1>Voice Agent</h1>

      <p>Status: {status}</p>


      <div>

        <button onClick={status === "idle" ? startCall : endCall}>

          {status === "idle" ? "Start Call" : "End Call"}

        </button>

        {status !== "idle" && (

          <button onClick={toggleMute}>

            {isMuted ? "Unmute" : "Mute"}

          </button>

        )}

      </div>


      {interimTranscript && (

        <p>

          <em>{interimTranscript}</em>

        </p>

      )}


      {transcript.map((msg, i) => (

        <p key={i}>

          <strong>{msg.role}:</strong> {msg.text}

        </p>

      ))}


      {metrics && (

        <p>

          LLM: {metrics.llm_ms}ms | TTS: {metrics.tts_ms}ms | First

          audio: {metrics.first_audio_ms}ms

        </p>

      )}

    </div>

  );

}


Explain Code

status 字段会在 "idle""listening""thinking""speaking""listening" 之间循环,提供你构建响应式 UI 所需的一切。

5. 运行

Terminal window


npm run dev


在浏览器中打开应用,选择 Start Call 并开始说话。你会看到转写文本实时出现,Agent 的响应会从你的扬声器播放出来。

添加流水线 hook

你可以在流水线的每个阶段拦截并转换数据。例如,过滤掉过短的转写(噪音)并在 TTS 之前调整发音:

JavaScript


export class MyVoiceAgent extends VoiceAgent {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  afterTranscribe(transcript, connection) {

    if (transcript.length < 3) return null;

    return transcript;

  }


  beforeSynthesize(text, connection) {

    return text.replace(/\bAI\b/g, "A.I.");

  }


  async onTurn(transcript, context) {

    return "You said: " + transcript;

  }

}


Explain Code

TypeScript


export class MyVoiceAgent extends VoiceAgent<Env> {

  transcriber = new WorkersAIFluxSTT(this.env.AI);

  tts = new WorkersAITTS(this.env.AI);


  afterTranscribe(transcript: string, connection: Connection) {

    if (transcript.length < 3) return null;

    return transcript;

  }


  beforeSynthesize(text: string, connection: Connection) {

    return text.replace(/\bAI\b/g, "A.I.");

  }


  async onTurn(transcript: string, context: VoiceTurnContext) {

    return "You said: " + transcript;

  }

}


Explain Code

afterTranscribe 返回 null 会完全丢弃该次发言——可用于过滤噪音或非常短的转写。

使用第三方 provider

无需更改你的 Agent 逻辑,即可换用第三方 STT 或 TTS provider:

JavaScript


import { ElevenLabsTTS } from "@cloudflare/voice-elevenlabs";

import { DeepgramSTT } from "@cloudflare/voice-deepgram";


export class MyVoiceAgent extends VoiceAgent {

  transcriber = new DeepgramSTT({

    apiKey: this.env.DEEPGRAM_API_KEY,

  });


  tts = new ElevenLabsTTS({

    apiKey: this.env.ELEVENLABS_API_KEY,

    voiceId: "21m00Tcm4TlvDq8ikWAM",

  });


  async onTurn(transcript, context) {

    return "You said: " + transcript;

  }

}


Explain Code

TypeScript


import { ElevenLabsTTS } from "@cloudflare/voice-elevenlabs";

import { DeepgramSTT } from "@cloudflare/voice-deepgram";


export class MyVoiceAgent extends VoiceAgent<Env> {

  transcriber = new DeepgramSTT({

    apiKey: this.env.DEEPGRAM_API_KEY,

  });


  tts = new ElevenLabsTTS({

    apiKey: this.env.ELEVENLABS_API_KEY,

    voiceId: "21m00Tcm4TlvDq8ikWAM",

  });


  async onTurn(transcript: string, context: VoiceTurnContext) {

    return "You said: " + transcript;

  }

}


Explain Code

后续步骤

语音 Agent API 参考 withVoice、withVoiceInput、React hook、VoiceClient 和所有 provider 的完整参考。

聊天 Agent 使用 AIChatAgent 和 useAgentChat 构建基于文本的 AI 聊天。

使用 AI 模型 在 Agent 中使用 Workers AI、OpenAI、Anthropic、Gemini 或任何 provider。