Build a Voice AI Agent in 30 Minutes with the ElevenLabs Skill
Step-by-step tutorial for shipping a working voice AI agent using the official ElevenLabs voice-agents skill. Covers persona, tools, streaming, and turn-taking.
Step-by-step tutorial for shipping a working voice AI agent using the official ElevenLabs voice-agents skill. Covers persona, tools, streaming, and turn-taking.
A year ago, scaffolding a real-time voice agent was a multi-week project. You needed to wire WebSocket streams, handle turn-detection, manage barge-in, integrate a tool layer, deal with audio buffering on both ends, and somehow make the latency acceptable. Most teams gave up after the prototype.
The official ElevenLabs Voice Agents skill compresses that wiring layer into something your AI coding assistant can scaffold for you. This tutorial walks through building a working voice agent — a customer-support bot for a fictional SaaS product — from zero to first conversation in about half an hour.
A voice agent named "Aria" that:
The full project ships in two files: an agent definition and a Node.js entry point.
If you have not set up the API key yet, install the setup-api-key skill first and ask your agent: "Use the setup-api-key skill to configure ElevenLabs." The skill walks through key creation in the dashboard and verifies the configuration before continuing.
Open Claude Code in an empty directory and prompt:
Use the elevenlabs voice-agents skill to scaffold a customer support agent named Aria. The agent should be friendly, professional, and capable of looking up accounts and creating tickets. Use English (US) and the voice "Sarah".
The skill emits a configuration block that looks roughly like this:
# agent.yaml
name: Aria
voice: sarah
language: en-US
system_prompt: |
You are Aria, a customer support agent for Acme SaaS. Be friendly, concise,
and professional. When a customer provides an email, use the lookup_account
tool. If you cannot resolve their issue, use the create_ticket tool.
Hand off to a human when explicitly requested.
tools:
- name: lookup_account
description: Look up a customer account by email address
parameters:
email: string
- name: create_ticket
description: Create a support ticket for the current customer
parameters:
summary: string
severity: enum [low, medium, high]
Save it. The skill will reference this file when generating the runtime code.
The agent calls tools through your application code. Prompt your assistant:
Generate the tool implementations for lookup_account and create_ticket. Use a stub data layer for now — we'll wire it to a real database after the agent is working.
You will get something like:
// tools.ts
export async function lookupAccount({ email }: { email: string }) {
// Stub — replace with real DB query later
if (email === 'demo@acme.com') {
return { found: true, plan: 'pro', mrr: 99 }
}
return { found: false }
}
export async function createTicket({
summary,
severity,
}: {
summary: string
severity: 'low' | 'medium' | 'high'
}) {
// Stub — replace with real ticket-system call later
const id = `TICK-${Math.floor(Math.random() * 10000)}`
console.log(`[ticket created] ${id} (${severity}): ${summary}`)
return { ticketId: id }
}
Stubs are deliberate. You want to validate the conversational flow before wiring real systems. The skill knows this and scaffolds stubs by default.
This is the part that used to be the hardest. Ask the assistant:
Generate the streaming entry point that connects to ElevenLabs over WebSocket and routes tool calls to my tool implementations.
The skill emits something like:
// index.ts
import { ElevenLabsClient } from '@elevenlabs/elevenlabs-js'
import { lookupAccount, createTicket } from './tools'
const client = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY! })
const session = await client.agents.startSession({
agentId: process.env.AGENT_ID!,
toolHandlers: {
lookup_account: lookupAccount,
create_ticket: createTicket,
},
})
session.on('audio', (chunk) => {
// Stream chunk to the caller's audio output
})
session.on('user_speech', (transcript) => {
console.log(`[caller] ${transcript}`)
})
session.on('agent_speech', (transcript) => {
console.log(`[aria] ${transcript}`)
})
session.on('end', (summary) => {
console.log('[call ended]', summary)
})
Notice what is not in this file: WebSocket framing, audio buffering, turn-detection, barge-in handling, retry logic. The skill wraps all of that inside startSession. You are left with the parts that actually matter to your application — what the tools do and where the audio goes.
For local testing, the easiest path is the dashboard's "test in browser" affordance. The skill prints a deep link to that page when scaffolding, so you can talk to your agent before wiring real audio I/O.
Once you are ready to drive audio from your own app, the skill scaffolds either a browser client (WebRTC) or a server bridge (twilio/livekit/agora). Ask your assistant for the variant you want:
Add a Twilio bridge so we can dial Aria from a phone number.
You will get a webhook handler that bridges Twilio Media Streams to the ElevenLabs session. Drop it into a Vercel Function, point a Twilio number at it, and you have a working phone agent.
Once the conversational flow is right, swap the stub tools for real implementations. This is the cheap part. Most teams find that 80% of the work was the wiring — now removed — and the remaining 20% is straightforward database calls.
Latency feels off. The skill defaults to WebSocket streaming, which is fine for development but adds 200-400ms of buffering. For production, switch to WebRTC. The skill knows how — ask it.
The agent ignores tools. Usually means the system prompt doesn't tell the agent when to call them. Be explicit: "Whenever a user provides an email address, call lookup_account." Skill-scaffolded prompts include this guidance, but it is worth checking.
Barge-in feels awkward. ElevenLabs handles barge-in by default, but some voices and pacing settings tune it down. The skill exposes interruptible: true on the session — make sure it is set.
session.on('end', ...) summaries into your analytics layer.The point of skills is that adding any of these is a one-line prompt to your assistant, not a one-week project.
AI soul and personality tracing skill — gives your AI agent a consistent persona, tone, and conversational identity across all interactions. 319.2K installs.
Build interactive voice AI agents capable of natural, low-latency conversation. The skill scaffolds agent definitions, tool calls, and turn-taking so you can ship voice features in hours, not weeks.
Turn any written content into lifelike speech using ElevenLabs voices. Works inside Claude Code, Cursor, and other Agent Skill-compatible assistants — no manual SDK wiring required.
Official Agent Skills suite from ElevenLabs: text-to-speech, speech-to-text, voice agents, sound effects, music, and voice isolation. Works in Claude Code, Cursor, and other compatible agents.