Hermes Voice Mode: Push-to-Talk in 20 Languages Powered by Claude
A look at Hermes push-to-talk voice mode: 20-language support, the edge-tts output stack, Claude doing the reasoning, and where voice actually beats chat.
Voice is one of those features that sounds like a gimmick until you use it for a half-hour walk and realize your hands-free debugging session was more productive than your last two at the keyboard. Hermes ships push-to-talk voice mode with 20-language support out of the box. Claude handles the reasoning; Hermes handles the plumbing around speech-to-text, text-to-speech, and the conversation state that connects them.
This piece is a spotlight on how voice mode works, which stack sits behind it, and when voice is actually the right input modality.
Key Takeaways
- Hermes voice mode is push-to-talk — you hold a key to speak, release to send. No wake word, no always-listening mic.
- 20 languages are supported across the STT/TTS pipeline.
- Output speech uses edge-tts, the Microsoft Edge neural voices wrapper. STT uses a pluggable pipeline.
- Reasoning is Claude Sonnet 4.6 (or whichever model you configure) — voice does not change the model.
- Voice integrates with the eight Hermes messaging gateways: a voice note from Telegram or WhatsApp can be transcribed and processed like any other input.
- Voice wins for phone walks, driving-adjacent tasks, and hands-busy workflows. It loses for code reading, table scanning, and anything that needs precise formatting.
The Stack
Voice mode is three layers glued together.
Capture. Push-to-talk binds a hotkey (configurable) to a recording session. Audio is captured from the system default input. The recording ends on key release and is sent to the STT layer.
Speech-to-text. The STT layer is pluggable. Depending on configuration, Hermes routes audio to a local Whisper model, a cloud STT provider, or another recognizer. The 20-language coverage tracks what the chosen STT backend supports; the default pipeline handles the major Latin, CJK, and Indic families.
Reasoning. The transcript becomes a normal user message to Claude. This is critical: voice input is a transport layer, not a different agent. The same skills, the same memory, the same MCP tools all apply.
Text-to-speech. Responses render through edge-tts, which wraps the Microsoft Edge neural voices. edge-tts is free, fast, and sounds better than most self-hosted alternatives for general-purpose narration. Voice selection is language-aware; the agent's reply language drives TTS voice choice.
A rough config excerpt for the reasoning half:
model:
provider: "anthropic"
default: "claude-sonnet-4-6"
The voice pipeline adds its own block alongside this; the reasoning config is unchanged from the text-only case.
Why Push-to-Talk
Always-listening wake-word voice assistants made sense for smart speakers in kitchens. They make less sense for a power-user agent on a workstation or phone. Three reasons push-to-talk is the better default:
- Privacy. The mic is off by default. You decide when the agent hears you.
- Precision. You speak in one uninterrupted block. No ambiguity about whether the agent is still listening.
- Latency. There is no wake-word detector burning cycles. Hotkey down, speak, release — the recording ends cleanly and transcription starts immediately.
The trade-off is you need a hotkey. On desktop that is trivial. On phone, the messaging-gateway integration solves this differently — there is no hotkey; you just send a voice note.
Voice Notes Through Messaging Gateways
Hermes ships eight messaging gateways: Telegram, Discord, Slack, WhatsApp, Signal, Matrix, Feishu, and DingTalk. All eight handle voice notes.
Workflow:
- You open Telegram on your phone, hold the voice-note button, speak, release.
- Telegram delivers the audio to your Hermes bot.
- Hermes runs the audio through its STT pipeline.
- The transcribed text becomes a user message to Claude.
- Claude's reply renders as text, and optionally as a voice note back to you using edge-tts.
For a half-hour walk where you want to think out loud with your agent, this is the practical deployment. Your phone is the client; your VPS-hosted Hermes is the reasoning layer. See Installing Hermes Agent on a $5 VPS for the hosting side, and Hermes Messaging Gateways: Telegram and Discord for the gateway configuration.
When Voice Wins
Voice earns its keep in specific situations, not universally.
- Mobile, hands-occupied work. Walking, driving-adjacent tasks, cooking — anywhere a keyboard is unavailable.
- Narrative thinking. "Here is the shape of the problem; talk me through options." Voice encourages longer, more coherent user prompts than typing does.
- Language-learning and translation. Speaking in one language and getting responses in another, both at conversational speed.
- Accessibility. For users with hand or vision limitations, voice removes a hard barrier.
When Voice Loses
Equally important to know when not to use it.
- Code reading. Syntax read aloud is painful. "Def underscore underscore init underscore underscore" is not how anyone wants to hear code.
- Tables and structured output. Voice cannot pronounce a markdown table in any useful way.
- Precise editing. Asking voice to change line 47 column 12 of a file is masochism. Switch to chat.
- Noisy environments. STT accuracy drops sharply with background noise. Push-to-talk partially mitigates this by bracketing the recording, but there are limits.
The practical pattern is mixed-mode: voice for thinking and direction-setting, chat for execution and review.
The 20-Language Reality
"20 languages" is an advertised number, but language quality is uneven across the pipeline. A few things to know:
- STT quality is best for English, Mandarin, Spanish, French, German, and Japanese. These have the most training data across most STT backends.
- TTS quality via edge-tts is broadly good — Microsoft's neural voices cover the major 20+ languages with native-speaker-quality output.
- Claude's own multilingual understanding is the ceiling. Sonnet 4.6 handles the major world languages well; obscure languages work but with reduced fluency.
If your workflow requires a specific language, test end-to-end before committing. The pipeline as a whole is only as good as its weakest layer for that language.
Bringing It Back Together
Voice mode is not a rebrand of chat. It is a genuinely different way to work with the same underlying agent. The same skills activate. The same memory persists. The same tools run. What changes is bandwidth and posture: voice is lower-bandwidth but higher-mobility, and some tasks fit that trade-off perfectly.
For the broader architectural picture of where voice slots into the Hermes stack, see What is Hermes Agent: a Claude-Compatible Runtime.
Sources
- Hermes Agent repository — https://github.com/NousResearch/hermes-agent
- Hermes documentation — https://hermes-agent.nousresearch.com/docs/
- edge-tts — https://github.com/rany2/edge-tts
- Anthropic Claude documentation — https://docs.anthropic.com/claude
- Series: Installing Hermes Agent on a $5 VPS
- Series: Hermes Messaging Gateways: Telegram and Discord
- Series: What is Hermes Agent