Speech API

Direct access to the local ASR and TTS engines via REST API and WebSocket.

The Speech API provides direct access to voicetyped’s local ASR and TTS engines. Use this API for testing, QA, custom integrations, or building speech-enabled applications that bypass the call pipeline entirely.

All endpoints are served from the Speech Gateway on port 8080 by default. Requests and responses use JSON with snake_case field names.

Endpoints

MethodPathDescription
POST/v1/speech/transcribeTranscribe a complete audio file
WebSocket/v1/speech/streamReal-time streaming ASR
POST/v1/speech/synthesizeText-to-speech synthesis
GET/v1/speech/modelsList available ASR models
GET/v1/speech/voicesList available TTS voices

POST /v1/speech/transcribe

Transcribe a complete audio file. Accepts multipart form data with an audio file and optional parameters.

Request

Send a multipart/form-data request with the following fields:

FieldTypeRequiredDescription
audiofileyesAudio file (WAV, FLAC, OGG, MP3)
languagestringnoLanguage hint, e.g. en, de, fr
word_timestampsbooleannoInclude word-level timing in response
curl -X POST "http://localhost:8080/v1/speech/transcribe" \
  -F "[email protected]" \
  -F "language=en" \
  -F "word_timestamps=true"

Response

{
  "text": "Hello, I'd like to check the status of my order.",
  "confidence": 0.94,
  "language": "en",
  "duration_ms": 3200,
  "segments": [
    {
      "text": "Hello, I'd like to check the status of my order.",
      "start_ms": 120,
      "end_ms": 3100,
      "confidence": 0.94,
      "words": [
        { "word": "Hello", "start_ms": 120, "end_ms": 450, "confidence": 0.98 },
        { "word": "I'd", "start_ms": 510, "end_ms": 640, "confidence": 0.95 },
        { "word": "like", "start_ms": 660, "end_ms": 820, "confidence": 0.96 },
        { "word": "to", "start_ms": 840, "end_ms": 920, "confidence": 0.97 },
        { "word": "check", "start_ms": 950, "end_ms": 1120, "confidence": 0.93 },
        { "word": "the", "start_ms": 1140, "end_ms": 1220, "confidence": 0.97 },
        { "word": "status", "start_ms": 1250, "end_ms": 1520, "confidence": 0.91 },
        { "word": "of", "start_ms": 1540, "end_ms": 1620, "confidence": 0.96 },
        { "word": "my", "start_ms": 1650, "end_ms": 1780, "confidence": 0.95 },
        { "word": "order", "start_ms": 1810, "end_ms": 3100, "confidence": 0.92 }
      ]
    }
  ]
}

WebSocket /v1/speech/stream

Real-time streaming ASR over WebSocket. Connect to the endpoint, send a configuration message, then stream binary audio frames. The server responds with JSON transcript messages as speech is recognized.

Connection

ws://localhost:8080/v1/speech/stream

Configuration Message

Send a JSON configuration message immediately after connecting:

{
  "type": "config",
  "encoding": "pcm16",
  "sample_rate": 16000,
  "channels": 1
}
FieldTypeDescription
typestringMust be "config"
encodingstringpcm16, float32, or opus
sample_rateintegerSample rate in Hz: 8000, 16000, 44100, 48000
channelsinteger1 (mono) or 2 (stereo)

After sending the config message, send binary WebSocket frames containing raw audio data. The recommended chunk size is 100ms of audio (e.g., 3200 bytes for 16kHz 16-bit mono).

Transcript Messages

The server sends JSON messages as transcription results become available:

{
  "is_final": false,
  "text": "hello I'd like to",
  "confidence": 0.82,
  "language": "en",
  "start_ms": 120,
  "end_ms": 1780
}

Final results include word-level timing:

{
  "is_final": true,
  "text": "Hello, I'd like to check the status of my order.",
  "confidence": 0.94,
  "language": "en",
  "start_ms": 120,
  "end_ms": 3100,
  "words": [
    { "word": "Hello", "start_ms": 120, "end_ms": 450, "confidence": 0.98 },
    { "word": "I'd", "start_ms": 510, "end_ms": 640, "confidence": 0.95 },
    { "word": "like", "start_ms": 660, "end_ms": 820, "confidence": 0.96 },
    { "word": "to", "start_ms": 840, "end_ms": 920, "confidence": 0.97 },
    { "word": "check", "start_ms": 950, "end_ms": 1120, "confidence": 0.93 },
    { "word": "the", "start_ms": 1140, "end_ms": 1220, "confidence": 0.97 },
    { "word": "status", "start_ms": 1250, "end_ms": 1520, "confidence": 0.91 },
    { "word": "of", "start_ms": 1540, "end_ms": 1620, "confidence": 0.96 },
    { "word": "my", "start_ms": 1650, "end_ms": 1780, "confidence": 0.95 },
    { "word": "order", "start_ms": 1810, "end_ms": 3100, "confidence": 0.92 }
  ]
}

JavaScript Example

const ws = new WebSocket("ws://localhost:8080/v1/speech/stream");

ws.addEventListener("open", () => {
  // Send audio configuration
  ws.send(JSON.stringify({
    type: "config",
    encoding: "pcm16",
    sample_rate: 16000,
    channels: 1,
  }));

  // Stream audio from the microphone
  navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
    const audioCtx = new AudioContext({ sampleRate: 16000 });
    const source = audioCtx.createMediaStreamSource(stream);
    const processor = audioCtx.createScriptProcessor(4096, 1, 1);

    processor.onaudioprocess = (event) => {
      const float32 = event.inputBuffer.getChannelData(0);
      const pcm16 = new Int16Array(float32.length);
      for (let i = 0; i < float32.length; i++) {
        pcm16[i] = Math.max(-32768, Math.min(32767, float32[i] * 32768));
      }
      ws.send(pcm16.buffer);
    };

    source.connect(processor);
    processor.connect(audioCtx.destination);
  });
});

ws.addEventListener("message", (event) => {
  const transcript = JSON.parse(event.data);
  if (transcript.is_final) {
    console.log(`Final: ${transcript.text} (${(transcript.confidence * 100).toFixed(0)}%)`);
  } else {
    console.log(`Partial: ${transcript.text}`);
  }
});

ws.addEventListener("close", () => {
  console.log("Stream closed");
});

ws.addEventListener("error", (err) => {
  console.error("WebSocket error:", err);
});

POST /v1/speech/synthesize

Convert text to speech. Returns audio binary with the appropriate Content-Type header.

Request

{
  "text": "Hello, your ticket has been created successfully.",
  "voice": "en_US-amy-medium",
  "speed": 1.0,
  "format": "wav"
}
FieldTypeRequiredDescription
textstringyesText to synthesize
voicestringnoVoice model name (defaults to system default)
speednumbernoSpeed multiplier, default 1.0
formatstringnoOutput format: wav, mp3, ogg, raw (default wav)
curl -X POST "http://localhost:8080/v1/speech/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text":"Hello world","voice":"en_US-amy-medium"}' \
  --output greeting.wav

Response

The response body is the raw audio binary. The Content-Type header indicates the format:

FormatContent-Type
wavaudio/wav
mp3audio/mpeg
oggaudio/ogg
rawaudio/pcm

GET /v1/speech/models

List available ASR models.

curl "http://localhost:8080/v1/speech/models"

Response

{
  "models": [
    {
      "name": "whisper-medium",
      "description": "General-purpose multilingual ASR model",
      "size_bytes": 1533001728,
      "loaded": true,
      "languages": ["en", "de", "fr", "es", "it", "pt", "nl", "ja", "zh"]
    },
    {
      "name": "whisper-small",
      "description": "Lightweight ASR model for low-resource environments",
      "size_bytes": 487997440,
      "loaded": false,
      "languages": ["en", "de", "fr", "es"]
    }
  ]
}

GET /v1/speech/voices

List available TTS voices. Optionally filter by language.

curl "http://localhost:8080/v1/speech/voices?language=en"

Query Parameters

ParameterTypeRequiredDescription
languagestringnoFilter voices by language code

Response

{
  "voices": [
    {
      "name": "en_US-amy-medium",
      "language": "en",
      "gender": "female",
      "sample_rate": 22050,
      "quality": "medium"
    },
    {
      "name": "en_US-joe-medium",
      "language": "en",
      "gender": "male",
      "sample_rate": 22050,
      "quality": "medium"
    },
    {
      "name": "en_GB-alba-medium",
      "language": "en",
      "gender": "female",
      "sample_rate": 22050,
      "quality": "medium"
    }
  ]
}

Error Responses

All endpoints return errors in a consistent JSON format:

{
  "error": {
    "code": "invalid_audio_format",
    "message": "Unsupported audio encoding. Expected WAV, FLAC, OGG, or MP3."
  }
}

Common HTTP status codes:

StatusDescription
400Bad request — missing or invalid parameters
404Model or voice not found
413Audio file too large
422Audio could not be processed
500Internal server error
503Model not loaded or engine unavailable

CLI Tools

The voice-gateway CLI provides convenience commands that use this API:

# Transcribe a file
voice-gateway transcribe recording.wav

# Transcribe from microphone
voice-gateway transcribe --mic

# Synthesize text to audio
voice-gateway speak "Hello world" --output greeting.wav

# List models
voice-gateway model list

# List voices
voice-gateway tts voices

Next Steps