Speech API
Direct access to the local ASR and TTS engines via REST API and WebSocket.
The Speech API provides direct access to voicetyped’s local ASR and TTS engines. Use this API for testing, QA, custom integrations, or building speech-enabled applications that bypass the call pipeline entirely.
All endpoints are served from the Speech Gateway on port 8080 by default. Requests and responses use JSON with snake_case field names.
Endpoints
| Method | Path | Description |
|---|---|---|
POST | /v1/speech/transcribe | Transcribe a complete audio file |
WebSocket | /v1/speech/stream | Real-time streaming ASR |
POST | /v1/speech/synthesize | Text-to-speech synthesis |
GET | /v1/speech/models | List available ASR models |
GET | /v1/speech/voices | List available TTS voices |
POST /v1/speech/transcribe
Transcribe a complete audio file. Accepts multipart form data with an audio file and optional parameters.
Request
Send a multipart/form-data request with the following fields:
| Field | Type | Required | Description |
|---|---|---|---|
audio | file | yes | Audio file (WAV, FLAC, OGG, MP3) |
language | string | no | Language hint, e.g. en, de, fr |
word_timestamps | boolean | no | Include word-level timing in response |
curl -X POST "http://localhost:8080/v1/speech/transcribe" \
-F "[email protected]" \
-F "language=en" \
-F "word_timestamps=true"
Response
{
"text": "Hello, I'd like to check the status of my order.",
"confidence": 0.94,
"language": "en",
"duration_ms": 3200,
"segments": [
{
"text": "Hello, I'd like to check the status of my order.",
"start_ms": 120,
"end_ms": 3100,
"confidence": 0.94,
"words": [
{ "word": "Hello", "start_ms": 120, "end_ms": 450, "confidence": 0.98 },
{ "word": "I'd", "start_ms": 510, "end_ms": 640, "confidence": 0.95 },
{ "word": "like", "start_ms": 660, "end_ms": 820, "confidence": 0.96 },
{ "word": "to", "start_ms": 840, "end_ms": 920, "confidence": 0.97 },
{ "word": "check", "start_ms": 950, "end_ms": 1120, "confidence": 0.93 },
{ "word": "the", "start_ms": 1140, "end_ms": 1220, "confidence": 0.97 },
{ "word": "status", "start_ms": 1250, "end_ms": 1520, "confidence": 0.91 },
{ "word": "of", "start_ms": 1540, "end_ms": 1620, "confidence": 0.96 },
{ "word": "my", "start_ms": 1650, "end_ms": 1780, "confidence": 0.95 },
{ "word": "order", "start_ms": 1810, "end_ms": 3100, "confidence": 0.92 }
]
}
]
}
WebSocket /v1/speech/stream
Real-time streaming ASR over WebSocket. Connect to the endpoint, send a configuration message, then stream binary audio frames. The server responds with JSON transcript messages as speech is recognized.
Connection
ws://localhost:8080/v1/speech/stream
Configuration Message
Send a JSON configuration message immediately after connecting:
{
"type": "config",
"encoding": "pcm16",
"sample_rate": 16000,
"channels": 1
}
| Field | Type | Description |
|---|---|---|
type | string | Must be "config" |
encoding | string | pcm16, float32, or opus |
sample_rate | integer | Sample rate in Hz: 8000, 16000, 44100, 48000 |
channels | integer | 1 (mono) or 2 (stereo) |
After sending the config message, send binary WebSocket frames containing raw audio data. The recommended chunk size is 100ms of audio (e.g., 3200 bytes for 16kHz 16-bit mono).
Transcript Messages
The server sends JSON messages as transcription results become available:
{
"is_final": false,
"text": "hello I'd like to",
"confidence": 0.82,
"language": "en",
"start_ms": 120,
"end_ms": 1780
}
Final results include word-level timing:
{
"is_final": true,
"text": "Hello, I'd like to check the status of my order.",
"confidence": 0.94,
"language": "en",
"start_ms": 120,
"end_ms": 3100,
"words": [
{ "word": "Hello", "start_ms": 120, "end_ms": 450, "confidence": 0.98 },
{ "word": "I'd", "start_ms": 510, "end_ms": 640, "confidence": 0.95 },
{ "word": "like", "start_ms": 660, "end_ms": 820, "confidence": 0.96 },
{ "word": "to", "start_ms": 840, "end_ms": 920, "confidence": 0.97 },
{ "word": "check", "start_ms": 950, "end_ms": 1120, "confidence": 0.93 },
{ "word": "the", "start_ms": 1140, "end_ms": 1220, "confidence": 0.97 },
{ "word": "status", "start_ms": 1250, "end_ms": 1520, "confidence": 0.91 },
{ "word": "of", "start_ms": 1540, "end_ms": 1620, "confidence": 0.96 },
{ "word": "my", "start_ms": 1650, "end_ms": 1780, "confidence": 0.95 },
{ "word": "order", "start_ms": 1810, "end_ms": 3100, "confidence": 0.92 }
]
}
JavaScript Example
const ws = new WebSocket("ws://localhost:8080/v1/speech/stream");
ws.addEventListener("open", () => {
// Send audio configuration
ws.send(JSON.stringify({
type: "config",
encoding: "pcm16",
sample_rate: 16000,
channels: 1,
}));
// Stream audio from the microphone
navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
const audioCtx = new AudioContext({ sampleRate: 16000 });
const source = audioCtx.createMediaStreamSource(stream);
const processor = audioCtx.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (event) => {
const float32 = event.inputBuffer.getChannelData(0);
const pcm16 = new Int16Array(float32.length);
for (let i = 0; i < float32.length; i++) {
pcm16[i] = Math.max(-32768, Math.min(32767, float32[i] * 32768));
}
ws.send(pcm16.buffer);
};
source.connect(processor);
processor.connect(audioCtx.destination);
});
});
ws.addEventListener("message", (event) => {
const transcript = JSON.parse(event.data);
if (transcript.is_final) {
console.log(`Final: ${transcript.text} (${(transcript.confidence * 100).toFixed(0)}%)`);
} else {
console.log(`Partial: ${transcript.text}`);
}
});
ws.addEventListener("close", () => {
console.log("Stream closed");
});
ws.addEventListener("error", (err) => {
console.error("WebSocket error:", err);
});
POST /v1/speech/synthesize
Convert text to speech. Returns audio binary with the appropriate Content-Type header.
Request
{
"text": "Hello, your ticket has been created successfully.",
"voice": "en_US-amy-medium",
"speed": 1.0,
"format": "wav"
}
| Field | Type | Required | Description |
|---|---|---|---|
text | string | yes | Text to synthesize |
voice | string | no | Voice model name (defaults to system default) |
speed | number | no | Speed multiplier, default 1.0 |
format | string | no | Output format: wav, mp3, ogg, raw (default wav) |
curl -X POST "http://localhost:8080/v1/speech/synthesize" \
-H "Content-Type: application/json" \
-d '{"text":"Hello world","voice":"en_US-amy-medium"}' \
--output greeting.wav
Response
The response body is the raw audio binary. The Content-Type header indicates the format:
| Format | Content-Type |
|---|---|
wav | audio/wav |
mp3 | audio/mpeg |
ogg | audio/ogg |
raw | audio/pcm |
GET /v1/speech/models
List available ASR models.
curl "http://localhost:8080/v1/speech/models"
Response
{
"models": [
{
"name": "whisper-medium",
"description": "General-purpose multilingual ASR model",
"size_bytes": 1533001728,
"loaded": true,
"languages": ["en", "de", "fr", "es", "it", "pt", "nl", "ja", "zh"]
},
{
"name": "whisper-small",
"description": "Lightweight ASR model for low-resource environments",
"size_bytes": 487997440,
"loaded": false,
"languages": ["en", "de", "fr", "es"]
}
]
}
GET /v1/speech/voices
List available TTS voices. Optionally filter by language.
curl "http://localhost:8080/v1/speech/voices?language=en"
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
language | string | no | Filter voices by language code |
Response
{
"voices": [
{
"name": "en_US-amy-medium",
"language": "en",
"gender": "female",
"sample_rate": 22050,
"quality": "medium"
},
{
"name": "en_US-joe-medium",
"language": "en",
"gender": "male",
"sample_rate": 22050,
"quality": "medium"
},
{
"name": "en_GB-alba-medium",
"language": "en",
"gender": "female",
"sample_rate": 22050,
"quality": "medium"
}
]
}
Error Responses
All endpoints return errors in a consistent JSON format:
{
"error": {
"code": "invalid_audio_format",
"message": "Unsupported audio encoding. Expected WAV, FLAC, OGG, or MP3."
}
}
Common HTTP status codes:
| Status | Description |
|---|---|
400 | Bad request — missing or invalid parameters |
404 | Model or voice not found |
413 | Audio file too large |
422 | Audio could not be processed |
500 | Internal server error |
503 | Model not loaded or engine unavailable |
CLI Tools
The voice-gateway CLI provides convenience commands that use this API:
# Transcribe a file
voice-gateway transcribe recording.wav
# Transcribe from microphone
voice-gateway transcribe --mic
# Synthesize text to audio
voice-gateway speak "Hello world" --output greeting.wav
# List models
voice-gateway model list
# List voices
voice-gateway tts voices
Next Steps
- Call Event Stream API — subscribe to call events
- Dialog Hooks API — implement backend services
- Speech Gateway — configure ASR tuning