Architecture Overview

How voicetyped's four core services work together to process voice calls.

voicetyped is composed of four main services that form a complete voice processing pipeline. Each service has a single responsibility and communicates with adjacent services via well-defined interfaces. This architecture enables independent scaling, testing, and replacement of any component.

High-Level Architecture

SIP / WebRTC
     │
     ▼
┌─────────────────┐
│  Media Gateway   │ ← SIP endpoint, RTP audio, codec handling
└────────┬────────┘
         │ PCM stream
         ▼
┌─────────────────┐
│ Speech Gateway   │ ← Local ASR (whisper.cpp), TTS
└────────┬────────┘
         │ Transcripts
         ▼
┌─────────────────┐
│  Conversation    │ ← Turn detection, dialog FSM, tool invocation
│  Runtime         │
└────────┬────────┘
         │ Actions
         ▼
┌─────────────────┐
│  Integration     │ ← REST/HTTP to customer backend
│  Gateway         │
└─────────────────┘
         │
         ▼
   Customer Backend

Service Interactions

Call Flow

When an inbound call arrives, the services interact in this sequence:

  1. Media Gateway receives the SIP INVITE, negotiates codecs, and begins extracting RTP audio
  2. Speech Gateway receives the PCM audio stream and begins producing transcripts
  3. Conversation Runtime receives transcript events and evaluates them against the active dialog FSM
  4. Integration Gateway executes any actions that require calling external systems
  5. Results flow back through the stack: Integration → Runtime → Speech (TTS) → Media → Caller

Communication Protocols

FromToProtocolData
Media GatewaySpeech GatewayInternal ConnectRPC streamPCM audio chunks
Speech GatewayConversation RuntimeInternal ConnectRPC streamTranscript events
Conversation RuntimeIntegration GatewayInternal ConnectRPCAction requests
Integration GatewayCustomer BackendREST / HTTPBusiness logic calls
Conversation RuntimeSpeech GatewayInternal ConnectRPCTTS requests
Speech GatewayMedia GatewayInternal ConnectRPC streamAudio playback

Call Session Model

Every active call is represented as a CallSession object that flows through the system:

CallSession
  ├── SessionID (unique per call)
  ├── CallerInfo (SIP headers, caller ID)
  ├── State (current FSM state)
  ├── Events[]
  │   ├── SpeechEvent (transcript, confidence, timing)
  │   ├── DTMFEvent (digit, duration)
  │   ├── TimeoutEvent (elapsed time)
  │   └── BackendResultEvent (response from integration)
  └── Actions[]
      ├── PlayTTS (text, voice)
      ├── Transfer (target SIP URI)
      ├── Hangup (reason code)
      └── CallHook (service, method, payload)

Service Details

Media Gateway

The Media Gateway is the telephony boundary of the system. It speaks SIP and RTP so the rest of the platform does not have to.

Responsibilities:

  • SIP endpoint (INVITE, BYE, CANCEL, re-INVITE, hold/resume)
  • RTP audio reception and transmission
  • Codec negotiation and transcoding (G.711 μ-law, G.711 A-law, Opus)
  • Jitter buffer management
  • DTMF detection (RFC 2833 and in-band)
  • Call lifecycle management

Implementation: Go with cgo bindings to PJSIP or a lightweight SIP stack.

Output: Normalized 16kHz mono PCM stream per active call.

Speech Gateway

The Speech Gateway handles all speech processing — both recognition (ASR) and synthesis (TTS).

Responsibilities:

  • Real-time speech recognition using whisper.cpp
  • Voice activity detection (VAD) and audio segmentation
  • Partial transcript streaming (interim results)
  • Final transcript delivery
  • Text-to-speech rendering
  • Per-call worker pool management

Implementation: Go service wrapping whisper.cpp via cgo. Optional faster-whisper (Python) backend for GPU-heavy workloads.

API: Internal ConnectRPC (high-performance binary protocol). External-facing APIs use REST/JSON — no special client tooling required.

Conversation Runtime

The Conversation Runtime is the core differentiator. It is not a chatbot builder — it is a deterministic runtime for dialog execution.

Responsibilities:

  • Turn detection (endpoint detection, barge-in handling)
  • Dialog state machine execution
  • Tool/action invocation
  • DTMF-driven menus
  • Timeout handling
  • Optional LLM node evaluation

Conversation Model:

# A dialog is a finite state machine
Dialog:
  name: string
  states:
    state_name:
      on_enter: Action[]
      transitions:
        - event: EventType
          condition: Expression  # optional
          target: StateName
          actions: Action[]      # optional

Key Design Decisions:

  • Deterministic by default — LLM nodes are opt-in, not the foundation
  • State machine is serializable — calls survive restarts
  • Per-call state isolation — no shared mutable state between calls

Integration Gateway

The Integration Gateway is the boundary between voicetyped and customer systems.

Responsibilities:

  • Outbound REST and HTTP calls to customer backends
  • Authentication (mTLS, API keys, OAuth2)
  • Retry with exponential backoff
  • Rate limiting (per-service, per-call)
  • Circuit breaking (half-open, open, closed states)
  • Request/response logging

Implementation: Go service with configurable backends.

Scaling Strategy

Horizontal Scaling

Each service can be scaled independently:

ServiceScale FactorStrategy
Media GatewayActive calls1 instance per ~100 concurrent calls
Speech GatewayASR workloadGPU instances, worker pool sizing
Conversation RuntimeActive sessionsStateless with external state store
Integration GatewayBackend call volumeStandard horizontal scaling

Kubernetes Scaling

In Kubernetes, the Helm chart configures Horizontal Pod Autoscalers:

autoscaling:
  mediaGateway:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPU: 70
  speechGateway:
    enabled: true
    minReplicas: 1
    maxReplicas: 5
    targetGPUUtilization: 80
  runtime:
    enabled: true
    minReplicas: 2
    maxReplicas: 20
    targetCPU: 60

Data Flow Guarantees

  • Audio never leaves the deployment — all ASR processing is local
  • Transcripts are ephemeral — stored only in call session memory unless explicitly persisted
  • Actions are idempotent — the Integration Gateway ensures at-least-once delivery with deduplication
  • State is recoverable — call sessions can be serialized and restored after restarts

Next Steps