Build a Voice AI Agent in 30 Minutes with the ElevenLabs Skill
Step-by-step tutorial for shipping a working voice AI agent using the official ElevenLabs voice-agents skill. Covers persona, tools, streaming, and turn-taking.
Build a Voice AI Agent in 30 Minutes with the ElevenLabs Skill
A year ago, scaffolding a real-time voice agent was a multi-week project. You needed to wire WebSocket streams, handle turn-detection, manage barge-in, integrate a tool layer, deal with audio buffering on both ends, and somehow make the latency acceptable. Most teams gave up after the prototype.
The official ElevenLabs Voice Agents skill compresses that wiring layer into something your AI coding assistant can scaffold for you. This tutorial walks through building a working voice agent — a customer-support bot for a fictional SaaS product — from zero to first conversation in about half an hour.
Key Takeaways
- The voice-agents skill scaffolds a full conversational agent, including persona, tool calls, and streaming setup.
- You configure the agent in the ElevenLabs dashboard and the skill emits the wiring code your app needs.
- Tool calling is first-class — the agent can look up customers, create tickets, or trigger workflows mid-conversation.
- Turn-taking and barge-in are handled by the skill, not your application code.
- Local development uses the streaming WebSocket; production deployments can stream over WebRTC for sub-200ms latency.
What you'll build
A voice agent named "Aria" that:
- Greets a caller and asks how it can help.
- Looks up account information when given an email address (tool call).
- Creates a support ticket when the issue cannot be resolved (tool call).
- Hands off to a human when the caller asks for one.
The full project ships in two files: an agent definition and a Node.js entry point.
Prerequisites
- An ElevenLabs account with the voice-agents skill installed
- An ELEVENLABS_API_KEY configured
- Node.js 20 or newer
- A Claude Code or Cursor workspace
If you have not set up the API key yet, install the setup-api-key skill first and ask your agent: "Use the setup-api-key skill to configure ElevenLabs." The skill walks through key creation in the dashboard and verifies the configuration before continuing.
Step 1 — Define the agent
Open Claude Code in an empty directory and prompt:
Use the elevenlabs voice-agents skill to scaffold a customer support agent named Aria. The agent should be friendly, professional, and capable of looking up accounts and creating tickets. Use English (US) and the voice "Sarah".
The skill emits a configuration block that looks roughly like this:
# agent.yaml
name: Aria
voice: sarah
language: en-US
system_prompt: |
You are Aria, a customer support agent for Acme SaaS. Be friendly, concise,
and professional. When a customer provides an email, use the lookup_account
tool. If you cannot resolve their issue, use the create_ticket tool.
Hand off to a human when explicitly requested.
tools:
- name: lookup_account
description: Look up a customer account by email address
parameters:
email: string
- name: create_ticket
description: Create a support ticket for the current customer
parameters:
summary: string
severity: enum [low, medium, high]
Save it. The skill will reference this file when generating the runtime code.
Step 2 — Implement the tool layer
The agent calls tools through your application code. Prompt your assistant:
Generate the tool implementations for lookup_account and create_ticket. Use a stub data layer for now — we'll wire it to a real database after the agent is working.
You will get something like:
// tools.ts
export async function lookupAccount({ email }: { email: string }) {
// Stub — replace with real DB query later
if (email === 'demo@acme.com') {
return { found: true, plan: 'pro', mrr: 99 }
}
return { found: false }
}
export async function createTicket({
summary,
severity,
}: {
summary: string
severity: 'low' | 'medium' | 'high'
}) {
// Stub — replace with real ticket-system call later
const id = `TICK-${Math.floor(Math.random() * 10000)}`
console.log(`[ticket created] ${id} (${severity}): ${summary}`)
return { ticketId: id }
}
Stubs are deliberate. You want to validate the conversational flow before wiring real systems. The skill knows this and scaffolds stubs by default.
Step 3 — Wire the streaming entry point
This is the part that used to be the hardest. Ask the assistant:
Generate the streaming entry point that connects to ElevenLabs over WebSocket and routes tool calls to my tool implementations.
The skill emits something like:
// index.ts
import { ElevenLabsClient } from '@elevenlabs/elevenlabs-js'
import { lookupAccount, createTicket } from './tools'
const client = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY! })
const session = await client.agents.startSession({
agentId: process.env.AGENT_ID!,
toolHandlers: {
lookup_account: lookupAccount,
create_ticket: createTicket,
},
})
session.on('audio', (chunk) => {
// Stream chunk to the caller's audio output
})
session.on('user_speech', (transcript) => {
console.log(`[caller] ${transcript}`)
})
session.on('agent_speech', (transcript) => {
console.log(`[aria] ${transcript}`)
})
session.on('end', (summary) => {
console.log('[call ended]', summary)
})
Notice what is not in this file: WebSocket framing, audio buffering, turn-detection, barge-in handling, retry logic. The skill wraps all of that inside startSession. You are left with the parts that actually matter to your application — what the tools do and where the audio goes.
Step 4 — Run a local conversation
For local testing, the easiest path is the dashboard's "test in browser" affordance. The skill prints a deep link to that page when scaffolding, so you can talk to your agent before wiring real audio I/O.
Once you are ready to drive audio from your own app, the skill scaffolds either a browser client (WebRTC) or a server bridge (twilio/livekit/agora). Ask your assistant for the variant you want:
Add a Twilio bridge so we can dial Aria from a phone number.
You will get a webhook handler that bridges Twilio Media Streams to the ElevenLabs session. Drop it into a Vercel Function, point a Twilio number at it, and you have a working phone agent.
Step 5 — Replace stubs with real systems
Once the conversational flow is right, swap the stub tools for real implementations. This is the cheap part. Most teams find that 80% of the work was the wiring — now removed — and the remaining 20% is straightforward database calls.
Common pitfalls
Latency feels off. The skill defaults to WebSocket streaming, which is fine for development but adds 200-400ms of buffering. For production, switch to WebRTC. The skill knows how — ask it.
The agent ignores tools. Usually means the system prompt doesn't tell the agent when to call them. Be explicit: "Whenever a user provides an email address, call lookup_account." Skill-scaffolded prompts include this guidance, but it is worth checking.
Barge-in feels awkward. ElevenLabs handles barge-in by default, but some voices and pacing settings tune it down. The skill exposes interruptible: true on the session — make sure it is set.
Where to go next
- Add a knowledge base — the agents skill supports retrieving from a vector store mid-conversation.
- Add analytics — pipe
session.on('end', ...)summaries into your analytics layer. - Layer in voice-isolator — pre-process inbound audio if your callers are in noisy environments.
The point of skills is that adding any of these is a one-line prompt to your assistant, not a one-week project.