Hermes Voice Mode: Push-to-Talk in 20 Languages Powered by Claude

Voice is one of those features that sounds like a gimmick until you use it for a half-hour walk and realize your hands-free debugging session was more productive than your last two at the keyboard. Hermes ships push-to-talk voice mode with 20-language support out of the box. Claude handles the reasoning; Hermes handles the plumbing around speech-to-text, text-to-speech, and the conversation state that connects them.

This piece is a spotlight on how voice mode works, which stack sits behind it, and when voice is actually the right input modality.

Key Takeaways

Hermes voice mode is push-to-talk — you hold a key to speak, release to send. No wake word, no always-listening mic.
20 languages are supported across the STT/TTS pipeline.
Output speech uses edge-tts, the Microsoft Edge neural voices wrapper. STT uses a pluggable pipeline.
Reasoning is Claude Sonnet 4.6 (or whichever model you configure) — voice does not change the model.
Voice integrates with the eight Hermes messaging gateways: a voice note from Telegram or WhatsApp can be transcribed and processed like any other input.
Voice wins for phone walks, driving-adjacent tasks, and hands-busy workflows. It loses for code reading, table scanning, and anything that needs precise formatting.

The Stack

Voice mode is three layers glued together.

Capture. Push-to-talk binds a hotkey (configurable) to a recording session. Audio is captured from the system default input. The recording ends on key release and is sent to the STT layer.

Speech-to-text. The STT layer is pluggable. Depending on configuration, Hermes routes audio to a local Whisper model, a cloud STT provider, or another recognizer. The 20-language coverage tracks what the chosen STT backend supports; the default pipeline handles the major Latin, CJK, and Indic families.

Reasoning. The transcript becomes a normal user message to Claude. This is critical: voice input is a transport layer, not a different agent. The same skills, the same memory, the same MCP tools all apply.

Text-to-speech. Responses render through edge-tts, which wraps the Microsoft Edge neural voices. edge-tts is free, fast, and sounds better than most self-hosted alternatives for general-purpose narration. Voice selection is language-aware; the agent's reply language drives TTS voice choice.

A rough config excerpt for the reasoning half:

model:
  provider: "anthropic"
  default: "claude-sonnet-4-6"

The voice pipeline adds its own block alongside this; the reasoning config is unchanged from the text-only case.

Why Push-to-Talk

Always-listening wake-word voice assistants made sense for smart speakers in kitchens. They make less sense for a power-user agent on a workstation or phone. Three reasons push-to-talk is the better default:

Privacy. The mic is off by default. You decide when the agent hears you.
Precision. You speak in one uninterrupted block. No ambiguity about whether the agent is still listening.
Latency. There is no wake-word detector burning cycles. Hotkey down, speak, release — the recording ends cleanly and transcription starts immediately.

The trade-off is you need a hotkey. On desktop that is trivial. On phone, the messaging-gateway integration solves this differently — there is no hotkey; you just send a voice note.

Voice Notes Through Messaging Gateways

Hermes ships eight messaging gateways: Telegram, Discord, Slack, WhatsApp, Signal, Matrix, Feishu, and DingTalk. All eight handle voice notes.

Workflow:

You open Telegram on your phone, hold the voice-note button, speak, release.
Telegram delivers the audio to your Hermes bot.
Hermes runs the audio through its STT pipeline.
The transcribed text becomes a user message to Claude.
Claude's reply renders as text, and optionally as a voice note back to you using edge-tts.

For a half-hour walk where you want to think out loud with your agent, this is the practical deployment. Your phone is the client; your VPS-hosted Hermes is the reasoning layer. See Installing Hermes Agent on a $5 VPS for the hosting side, and Hermes Messaging Gateways: Telegram and Discord for the gateway configuration.

When Voice Wins

Voice earns its keep in specific situations, not universally.

Mobile, hands-occupied work. Walking, driving-adjacent tasks, cooking — anywhere a keyboard is unavailable.
Narrative thinking. "Here is the shape of the problem; talk me through options." Voice encourages longer, more coherent user prompts than typing does.
Language-learning and translation. Speaking in one language and getting responses in another, both at conversational speed.
Accessibility. For users with hand or vision limitations, voice removes a hard barrier.

When Voice Loses

Equally important to know when not to use it.

Code reading. Syntax read aloud is painful. "Def underscore underscore init underscore underscore" is not how anyone wants to hear code.
Tables and structured output. Voice cannot pronounce a markdown table in any useful way.
Precise editing. Asking voice to change line 47 column 12 of a file is masochism. Switch to chat.
Noisy environments. STT accuracy drops sharply with background noise. Push-to-talk partially mitigates this by bracketing the recording, but there are limits.

The practical pattern is mixed-mode: voice for thinking and direction-setting, chat for execution and review.

The 20-Language Reality

"20 languages" is an advertised number, but language quality is uneven across the pipeline. A few things to know:

STT quality is best for English, Mandarin, Spanish, French, German, and Japanese. These have the most training data across most STT backends.
TTS quality via edge-tts is broadly good — Microsoft's neural voices cover the major 20+ languages with native-speaker-quality output.
Claude's own multilingual understanding is the ceiling. Sonnet 4.6 handles the major world languages well; obscure languages work but with reduced fluency.

If your workflow requires a specific language, test end-to-end before committing. The pipeline as a whole is only as good as its weakest layer for that language.

Bringing It Back Together

Voice mode is not a rebrand of chat. It is a genuinely different way to work with the same underlying agent. The same skills activate. The same memory persists. The same tools run. What changes is bandwidth and posture: voice is lower-bandwidth but higher-mobility, and some tasks fit that trade-off perfectly.

For the broader architectural picture of where voice slots into the Hermes stack, see What is Hermes Agent: a Claude-Compatible Runtime.

Sources

Hermes Agent repository — https://github.com/NousResearch/hermes-agent
Hermes documentation — https://hermes-agent.nousresearch.com/docs/
edge-tts — https://github.com/rany2/edge-tts
Anthropic Claude documentation — https://docs.anthropic.com/claude
Series: Installing Hermes Agent on a $5 VPS
Series: Hermes Messaging Gateways: Telegram and Discord
Series: What is Hermes Agent

Hermes Voice Mode: Push-to-Talk in 20 Languages Powered by Claude

Key Takeaways

The Stack

Why Push-to-Talk

Voice Notes Through Messaging Gateways

When Voice Wins

When Voice Loses

The 20-Language Reality

Bringing It Back Together

Sources

Related Skills to Try

Related Skills to Try

ElevenLabs Text to Speech

Related Articles

Related Articles

Smart Proxy Patterns for AI Agents

The 10 Best Free Academic AI Agents for 2026

The 10 Best Cursor-Compatible AI Agents

ElevenLabs Text to Speech

kanban-worker

comfyui

spotify

kanban-worker

comfyui

spotify