Speech API

The Speech API provides direct access to voicetyped’s local ASR and TTS engines. Use this API for testing, QA, custom integrations, or building speech-enabled applications that bypass the call pipeline entirely.

All endpoints are served from the Speech Gateway on port 8080 by default. Requests and responses use JSON with snake_case field names.

Endpoints

Method	Path	Description
`POST`	`/v1/speech/transcribe`	Transcribe a complete audio file
`WebSocket`	`/v1/speech/stream`	Real-time streaming ASR
`POST`	`/v1/speech/synthesize`	Text-to-speech synthesis
`GET`	`/v1/speech/models`	List available ASR models
`GET`	`/v1/speech/voices`	List available TTS voices

POST /v1/speech/transcribe

Transcribe a complete audio file. Accepts multipart form data with an audio file and optional parameters.

Request

Send a multipart/form-data request with the following fields:

Field	Type	Required	Description
`audio`	file	yes	Audio file (WAV, FLAC, OGG, MP3)
`language`	string	no	Language hint, e.g. `en`, `de`, `fr`
`word_timestamps`	boolean	no	Include word-level timing in response

curl -X POST "http://localhost:8080/v1/speech/transcribe" \
  -F "[email protected]" \
  -F "language=en" \
  -F "word_timestamps=true"

Response

{
  "text": "Hello, I'd like to check the status of my order.",
  "confidence": 0.94,
  "language": "en",
  "duration_ms": 3200,
  "segments": [
    {
      "text": "Hello, I'd like to check the status of my order.",
      "start_ms": 120,
      "end_ms": 3100,
      "confidence": 0.94,
      "words": [
        { "word": "Hello", "start_ms": 120, "end_ms": 450, "confidence": 0.98 },
        { "word": "I'd", "start_ms": 510, "end_ms": 640, "confidence": 0.95 },
        { "word": "like", "start_ms": 660, "end_ms": 820, "confidence": 0.96 },
        { "word": "to", "start_ms": 840, "end_ms": 920, "confidence": 0.97 },
        { "word": "check", "start_ms": 950, "end_ms": 1120, "confidence": 0.93 },
        { "word": "the", "start_ms": 1140, "end_ms": 1220, "confidence": 0.97 },
        { "word": "status", "start_ms": 1250, "end_ms": 1520, "confidence": 0.91 },
        { "word": "of", "start_ms": 1540, "end_ms": 1620, "confidence": 0.96 },
        { "word": "my", "start_ms": 1650, "end_ms": 1780, "confidence": 0.95 },
        { "word": "order", "start_ms": 1810, "end_ms": 3100, "confidence": 0.92 }
      ]
    }
  ]
}

WebSocket /v1/speech/stream

Real-time streaming ASR over WebSocket. Connect to the endpoint, send a configuration message, then stream binary audio frames. The server responds with JSON transcript messages as speech is recognized.

Connection

ws://localhost:8080/v1/speech/stream

Configuration Message

Send a JSON configuration message immediately after connecting:

{
  "type": "config",
  "encoding": "pcm16",
  "sample_rate": 16000,
  "channels": 1
}

Field	Type	Description
`type`	string	Must be `"config"`
`encoding`	string	`pcm16`, `float32`, or `opus`
`sample_rate`	integer	Sample rate in Hz: 8000, 16000, 44100, 48000
`channels`	integer	1 (mono) or 2 (stereo)

After sending the config message, send binary WebSocket frames containing raw audio data. The recommended chunk size is 100ms of audio (e.g., 3200 bytes for 16kHz 16-bit mono).

Transcript Messages

The server sends JSON messages as transcription results become available:

{
  "is_final": false,
  "text": "hello I'd like to",
  "confidence": 0.82,
  "language": "en",
  "start_ms": 120,
  "end_ms": 1780
}

Final results include word-level timing:

{
  "is_final": true,
  "text": "Hello, I'd like to check the status of my order.",
  "confidence": 0.94,
  "language": "en",
  "start_ms": 120,
  "end_ms": 3100,
  "words": [
    { "word": "Hello", "start_ms": 120, "end_ms": 450, "confidence": 0.98 },
    { "word": "I'd", "start_ms": 510, "end_ms": 640, "confidence": 0.95 },
    { "word": "like", "start_ms": 660, "end_ms": 820, "confidence": 0.96 },
    { "word": "to", "start_ms": 840, "end_ms": 920, "confidence": 0.97 },
    { "word": "check", "start_ms": 950, "end_ms": 1120, "confidence": 0.93 },
    { "word": "the", "start_ms": 1140, "end_ms": 1220, "confidence": 0.97 },
    { "word": "status", "start_ms": 1250, "end_ms": 1520, "confidence": 0.91 },
    { "word": "of", "start_ms": 1540, "end_ms": 1620, "confidence": 0.96 },
    { "word": "my", "start_ms": 1650, "end_ms": 1780, "confidence": 0.95 },
    { "word": "order", "start_ms": 1810, "end_ms": 3100, "confidence": 0.92 }
  ]
}

JavaScript Example

const ws = new WebSocket("ws://localhost:8080/v1/speech/stream");

ws.addEventListener("open", () => {
  // Send audio configuration
  ws.send(JSON.stringify({
    type: "config",
    encoding: "pcm16",
    sample_rate: 16000,
    channels: 1,
  }));

  // Stream audio from the microphone
  navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
    const audioCtx = new AudioContext({ sampleRate: 16000 });
    const source = audioCtx.createMediaStreamSource(stream);
    const processor = audioCtx.createScriptProcessor(4096, 1, 1);

    processor.onaudioprocess = (event) => {
      const float32 = event.inputBuffer.getChannelData(0);
      const pcm16 = new Int16Array(float32.length);
      for (let i = 0; i < float32.length; i++) {
        pcm16[i] = Math.max(-32768, Math.min(32767, float32[i] * 32768));
      }
      ws.send(pcm16.buffer);
    };

    source.connect(processor);
    processor.connect(audioCtx.destination);
  });
});

ws.addEventListener("message", (event) => {
  const transcript = JSON.parse(event.data);
  if (transcript.is_final) {
    console.log(`Final: ${transcript.text} (${(transcript.confidence * 100).toFixed(0)}%)`);
  } else {
    console.log(`Partial: ${transcript.text}`);
  }
});

ws.addEventListener("close", () => {
  console.log("Stream closed");
});

ws.addEventListener("error", (err) => {
  console.error("WebSocket error:", err);
});

POST /v1/speech/synthesize

Convert text to speech. Returns audio binary with the appropriate Content-Type header.

Request

{
  "text": "Hello, your ticket has been created successfully.",
  "voice": "en_US-amy-medium",
  "speed": 1.0,
  "format": "wav"
}

Field	Type	Required	Description
`text`	string	yes	Text to synthesize
`voice`	string	no	Voice model name (defaults to system default)
`speed`	number	no	Speed multiplier, default `1.0`
`format`	string	no	Output format: `wav`, `mp3`, `ogg`, `raw` (default `wav`)

curl -X POST "http://localhost:8080/v1/speech/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text":"Hello world","voice":"en_US-amy-medium"}' \
  --output greeting.wav

Response

The response body is the raw audio binary. The Content-Type header indicates the format:

Format	Content-Type
`wav`	`audio/wav`
`mp3`	`audio/mpeg`
`ogg`	`audio/ogg`
`raw`	`audio/pcm`

GET /v1/speech/models

List available ASR models.

curl "http://localhost:8080/v1/speech/models"

Response

{
  "models": [
    {
      "name": "whisper-medium",
      "description": "General-purpose multilingual ASR model",
      "size_bytes": 1533001728,
      "loaded": true,
      "languages": ["en", "de", "fr", "es", "it", "pt", "nl", "ja", "zh"]
    },
    {
      "name": "whisper-small",
      "description": "Lightweight ASR model for low-resource environments",
      "size_bytes": 487997440,
      "loaded": false,
      "languages": ["en", "de", "fr", "es"]
    }
  ]
}

GET /v1/speech/voices

List available TTS voices. Optionally filter by language.

curl "http://localhost:8080/v1/speech/voices?language=en"

Query Parameters

Parameter	Type	Required	Description
`language`	string	no	Filter voices by language code

Response

{
  "voices": [
    {
      "name": "en_US-amy-medium",
      "language": "en",
      "gender": "female",
      "sample_rate": 22050,
      "quality": "medium"
    },
    {
      "name": "en_US-joe-medium",
      "language": "en",
      "gender": "male",
      "sample_rate": 22050,
      "quality": "medium"
    },
    {
      "name": "en_GB-alba-medium",
      "language": "en",
      "gender": "female",
      "sample_rate": 22050,
      "quality": "medium"
    }
  ]
}

Error Responses

All endpoints return errors in a consistent JSON format:

{
  "error": {
    "code": "invalid_audio_format",
    "message": "Unsupported audio encoding. Expected WAV, FLAC, OGG, or MP3."
  }
}

Common HTTP status codes:

Status	Description
`400`	Bad request — missing or invalid parameters
`404`	Model or voice not found
`413`	Audio file too large
`422`	Audio could not be processed
`500`	Internal server error
`503`	Model not loaded or engine unavailable

CLI Tools

The voice-gateway CLI provides convenience commands that use this API:

# Transcribe a file
voice-gateway transcribe recording.wav

# Transcribe from microphone
voice-gateway transcribe --mic

# Synthesize text to audio
voice-gateway speak "Hello world" --output greeting.wav

# List models
voice-gateway model list

# List voices
voice-gateway tts voices

Next Steps

Call Event Stream API — subscribe to call events
Dialog Hooks API — implement backend services
Speech Gateway — configure ASR tuning