Speech Gateway
Local ASR and TTS engine powered by whisper.cpp — no cloud, no data leakage.
The Speech Gateway handles all speech processing — recognition (ASR) and synthesis (TTS). It runs entirely on your infrastructure, ensuring that audio data never leaves your network. This is the component that makes voicetyped viable for regulated industries, classified environments, and privacy-sensitive deployments.
Responsibilities
- Automatic Speech Recognition (ASR) — Real-time transcription using whisper.cpp
- Voice Activity Detection (VAD) — Detects when the caller is speaking
- Audio Segmentation — Splits continuous audio into utterances for transcription
- Partial Transcripts — Streams interim results for responsive UX
- Final Transcripts — Delivers complete, punctuated transcripts
- Text-to-Speech (TTS) — Renders text responses as audio
- Worker Pool — Manages per-call ASR workers for concurrent processing
Configuration
# /etc/voice-gateway/config.yaml — speech section
speech:
# ASR Engine
engine: whisper # whisper (default), faster-whisper
model: whisper-medium # Model name (see model table)
model_dir: /var/lib/voice-gateway/models/
language: en # ISO 639-1 language code
# GPU Configuration
gpu: auto # auto, true, false
gpu_device: 0 # GPU device index
gpu_layers: -1 # -1 = all layers on GPU
# VAD Configuration
vad:
enabled: true
threshold: 0.5 # Speech probability threshold (0.0–1.0)
min_speech_ms: 250 # Minimum speech duration to trigger
min_silence_ms: 500 # Silence duration to end utterance
padding_ms: 200 # Padding added around speech segments
# Transcription
partial_results: true # Stream interim/partial transcripts
partial_interval_ms: 300 # How often to emit partial results
beam_size: 5 # Beam search width (higher = more accurate, slower)
temperature: 0.0 # Sampling temperature (0 = greedy)
# Worker Pool
max_workers: 4 # Maximum concurrent ASR workers
worker_timeout: 30s # Worker idle timeout
queue_depth: 10 # Maximum queued audio segments
# TTS Configuration
tts:
engine: piper # piper (default), espeak
voice: en_US-amy-medium # Voice model name
sample_rate: 22050 # Output sample rate
speed: 1.0 # Speaking speed multiplier
ASR Engine
whisper.cpp
The default ASR engine is whisper.cpp, a C++ port of OpenAI’s Whisper model. It runs on CPU or GPU and provides excellent accuracy for most languages.
How it works:
- Audio arrives as 16kHz mono PCM chunks from the Media Gateway
- VAD detects speech segments and buffers them
- Complete utterances are sent to whisper.cpp for transcription
- Partial results are emitted at configurable intervals during long utterances
- Final results include the complete transcript with timing information
Model Selection
| Model | Parameters | Size | Speed (CPU) | Speed (GPU) | Quality |
|---|---|---|---|---|---|
| whisper-tiny | 39M | 75 MB | ~10x real-time | ~32x | Fair |
| whisper-base | 74M | 142 MB | ~7x real-time | ~25x | Good |
| whisper-small | 244M | 466 MB | ~4x real-time | ~15x | Better |
| whisper-medium | 769M | 1.5 GB | ~2x real-time | ~10x | High |
| whisper-large-v3 | 1550M | 3.1 GB | ~0.5x real-time | ~5x | Highest |
Recommendation: Use
whisper-mediumfor production. It provides the best balance of accuracy and latency. Usewhisper-basefor development and testing.
Model Management
# List available models
voice-gateway model list
# Download a model
voice-gateway model download whisper-medium
# Check loaded model
voice-gateway model info
# Switch model at runtime (requires restart)
voice-gateway config set speech.model whisper-large-v3
faster-whisper Backend
For GPU-heavy deployments, you can use faster-whisper as an alternative backend. It uses CTranslate2 for optimized inference.
speech:
engine: faster-whisper
model: large-v3
gpu: true
compute_type: float16 # float16, int8_float16, int8
faster-whisper provides:
- ~4x speed improvement over whisper.cpp on GPU
- Lower memory usage via quantization
- Batch inference support
Voice Activity Detection (VAD)
VAD is critical for determining when the caller is speaking and when they have finished. Poor VAD leads to either cut-off speech or long pauses.
How VAD Works
Audio Stream → Energy Detection → Speech Probability → Segmentation
│
┌──────────┴──────────┐
Speech Start Speech End
(> threshold) (silence > min_silence_ms)
Tuning VAD
| Parameter | Effect of Increase | Effect of Decrease |
|---|---|---|
threshold | Requires louder speech to trigger | Triggers on quieter speech, more false positives |
min_speech_ms | Ignores short sounds (clicks, pops) | Captures very short utterances |
min_silence_ms | Waits longer before ending utterance | Ends utterance faster, may split long pauses |
padding_ms | More context around speech | Less context, may clip edges |
Recommended settings by environment:
# Quiet office environment
vad:
threshold: 0.4
min_speech_ms: 200
min_silence_ms: 400
# Noisy call center
vad:
threshold: 0.7
min_speech_ms: 300
min_silence_ms: 600
# IVR with short commands
vad:
threshold: 0.5
min_speech_ms: 150
min_silence_ms: 300
Partial vs Final Transcripts
Partial Transcripts
Emitted during active speech at partial_interval_ms intervals. These are useful for:
- Displaying real-time captions
- Triggering early intent detection
- Providing visual feedback in admin UIs
{
"type": "partial",
"text": "I need help with my",
"confidence": 0.82,
"timestamp_ms": 1234
}
Final Transcripts
Emitted after VAD detects end-of-utterance. These are the canonical transcription result:
{
"type": "final",
"text": "I need help with my password reset.",
"confidence": 0.94,
"language": "en",
"duration_ms": 2340,
"segments": [
{
"text": "I need help with my password reset.",
"start_ms": 0,
"end_ms": 2340,
"confidence": 0.94
}
]
}
Text-to-Speech (TTS)
Piper TTS
The default TTS engine is Piper, a fast, local text-to-speech system.
# Download a voice model
voice-gateway tts download en_US-amy-medium
# List available voices
voice-gateway tts voices
# Test TTS output
voice-gateway tts speak "Hello, this is a test."
TTS Configuration
speech:
tts:
engine: piper
voice: en_US-amy-medium
sample_rate: 22050
speed: 1.0 # 0.5 = half speed, 2.0 = double speed
sentence_silence: 0.3 # Silence between sentences (seconds)
Streaming TTS
TTS audio is streamed back to the caller as it is generated, reducing perceived latency:
- Text is split into sentences
- Each sentence is synthesized independently
- Audio chunks are streamed to the Media Gateway in real-time
- The caller hears the first sentence while later sentences are still being generated
Worker Pool
The Speech Gateway uses a worker pool to process concurrent calls:
┌──────────────┐
│ Call 1 │ → Worker 1 (ASR) → Transcript
│ Call 2 │ → Worker 2 (ASR) → Transcript
│ Call 3 │ → Worker 3 (ASR) → Transcript
│ Call 4 │ → [Queued]
└──────────────┘
Sizing the Worker Pool
| GPU | Model | Recommended Workers | Max Concurrent Calls |
|---|---|---|---|
| None (CPU only) | whisper-base | 2 | 5-8 |
| None (CPU only) | whisper-medium | 1 | 2-3 |
| NVIDIA T4 | whisper-medium | 4 | 15-20 |
| NVIDIA A100 | whisper-large-v3 | 8 | 40-60 |
Metrics
| Metric | Type | Description |
|---|---|---|
vg_speech_asr_latency_seconds | Histogram | Time from audio to transcript |
vg_speech_transcriptions_total | Counter | Total transcriptions completed |
vg_speech_active_workers | Gauge | Currently active ASR workers |
vg_speech_queue_depth | Gauge | Audio segments waiting for processing |
vg_speech_tts_latency_seconds | Histogram | TTS generation latency |
vg_speech_vad_false_positives | Counter | VAD false positive triggers |
vg_speech_gpu_utilization | Gauge | GPU utilization percentage |
Troubleshooting
High ASR latency
- Check GPU utilization — switch to a GPU if on CPU
- Use a smaller model (whisper-base for testing)
- Reduce beam size (
beam_size: 3) - Increase worker pool size
Transcripts are cut off
- Increase
min_silence_msto wait longer before ending utterance - Increase
padding_msto capture more audio context - Check that VAD threshold isn’t too high
Poor transcript quality
- Use a larger model (whisper-medium or whisper-large-v3)
- Set
temperature: 0.0for deterministic output - Set
languageexplicitly rather than auto-detecting - Check audio quality — packet loss degrades ASR accuracy
Next Steps
- Conversation Runtime — build dialog flows on transcripts
- Media Gateway — configure the audio source
- Observability — monitor ASR performance