Technical Guide

OpenAI Realtime API: Build a Voice Agent in 2026 (Practical Guide)

OpenAI Realtime API is the first-party way to build a voice agent. Latency, pricing, what it does and does not solve, and when to use it vs a productized vendor.

Author
By the Open Team
|Updated May 30, 2026|12 min read

OpenAI's Realtime API turned voice agents from a stitched-together pipeline (speech-to-text → LLM → text-to-speech) into a single bidirectional model that takes audio in and emits audio out. Latency dropped, intonation improved, and the build pattern simplified. It's the most-talked-about voice primitive in 2026.

It's also not a complete voice agent. The model is one component; the carrier integration, the tool layer, the observability, the compliance posture, and the operational work are still your code. This piece is the practical engineering perspective on what Realtime actually solves and what it doesn't.

TL;DR

  • OpenAI Realtime = audio-in/audio-out model with tool calling, sub-300ms latency, WebSocket protocol.
  • What it solves: model-layer concerns. STT/LLM/TTS as one streaming pipe. Better intonation. Native tool calling.
  • What it doesn't solve: SIP bridging, integrations, observability, compliance, transfers, knowledge base, deployment.
  • Build with it if: voice AI is your product or you have a deeply unusual use case.
  • Buy a productized agent if: your business is the use case and you want to ship in days.

What the Realtime API actually is

A WebSocket-based API. You open a connection to OpenAI, stream audio in, receive audio out (and tool-call instructions, when the model needs to invoke a tool). The model handles the speech recognition, reasoning, and speech generation in a single forward pass.

The protocol is event-based:

  • session.update — configure the session (voice, language, tools, system prompt).
  • input_audio_buffer.append — stream audio in.
  • response.audio.delta — receive audio chunks back.
  • response.function_call_arguments.delta — tool call requests stream incrementally.

Sub-300ms first-token latency is achievable on US/EU regions. The model handles barge-in, partial utterances, and code-switching natively.

What it solves vs what it doesn't

Where an AI agent sits in the support stack

  1. Helpdesk

    Ticket management, agent UI, reporting. Intercom, Zendesk, Freshdesk, HubSpot, Salesforce, Twilio Flex

  2. AI agent

    Orchestrator

    Customer-facing reasoning and action execution. Native (Fin, Einstein, Freddy) or third-party (Open.cx)

  3. Knowledge base

    Source of truth for policy and procedures. Intercom Articles, Zendesk Guide, Notion, custom CMS

  4. Identity & auth

    Customer authentication. Auth0, Okta, custom SSO

  5. Transactional systems

    Orders, billing, subscriptions, fulfillment. Stripe, Shopify, custom OMS

  6. CRM

    Customer history and account context. Salesforce, HubSpot, Segment

  7. Observability

    Conversation logs, confidence sampling, replay. Platform-native, data warehouse, custom dashboards

The AI agent makes the rest of the stack invisible to the customer

The Realtime API solves the model-layer concerns of building a voice agent. Specifically:

  • STT + LLM + TTS as one pipe — no stitching, no inter-service latency.
  • Tool calling at the model level — the model knows how to call your tools while staying in audio mode.
  • Barge-in and overlap handling — built into the model, no custom voice-activity detection needed.
  • Multilingual — the model handles language detection and switching natively.

What it does NOT solve:

  • Carrier integration. OpenAI Realtime speaks WebSocket; phones speak SIP. You need a bridge: Twilio Media Streams, Pipecat, or your own. This bridge is non-trivial production code.
  • Integrations. No native CRM, helpdesk, calendar, or billing integrations. You build each one.
  • Knowledge base. No retrieval layer, no document syncing. You bring your own.
  • Observability. No transcript export, no reasoning traces, no outcome tags out of the box. You build the logging.
  • Compliance. SOC 2, HIPAA, PCI — your stack, your responsibility.
  • Warm transfers. The model can decide to transfer; you build the SIP REFER plumbing.
  • Operational tooling. Agent versioning, A/B traffic splitting, prompt rollouts — your build.

The model is one of seven layers in a production voice agent. Realtime is the fastest, cleanest version of that one layer. The other six layers haven't gone away.

A practical architecture

If you're building on Realtime, the production stack typically looks like:

Carrier (Twilio / Vonage / your trunk)
    ↓ SIP
Bridge service (Twilio Media Streams / Pipecat / custom)
    ↓ WebSocket
OpenAI Realtime API (model layer)
    ↓ tool calls
Your tool layer (CRM, calendar, helpdesk, billing connectors)
    ↓ side effects
Observability (transcript, trace, outcome logging)

Six layers, four of which are your code. The model layer is the easy part now; the other layers are the project.

Latency in practice

Median first-token latency on Realtime is sub-300ms in US/EU regions. This is meaningfully lower than the stitched STT→LLM→TTS pattern, which typically lands at 600-900ms median.

For voice perception, the gap matters. Sub-300ms feels conversational; 600-900ms feels robotic. The Realtime improvement is real and noticeable to customers.

That said: the carrier-side path adds latency (typically 50-100ms each way for SIP-WebSocket bridging), and the tool-call layer adds latency when the agent needs to look something up before responding. End-to-end production latency typically lands at 400-600ms even with Realtime — better than the stitched pipeline, not magic.

Cost in practice

Worked example: mid-market team

Applying the five levers to a real bill.

$2,245 saved / mo · 32% lower
Seats (Advanced)
$1,275
$1,020
Lever 1
Fin AI
$4,950
$1,560
Levers 3 + 4
Outcome-priced AI
$0
$2,100
Lever 4
Phone
$400
$0
Lever 2
Proactive Support+
$150
$50
Lever 2
KB widget (extra)
$200
$0
Lever 5
Monthly total
$6,975
$4,730
All five

Pricing as of mid-2026 (verify current rates on the OpenAI pricing page):

  • Audio input: ~$0.06 per minute.
  • Audio output: ~$0.24 per minute.
  • Tool-call text: standard GPT-4o text token pricing.

For a 2-3 minute call:

  • 2.5 min × $0.06 (input) = $0.15
  • 1.5 min × $0.24 (output, assuming 60% talk-time on the AI side) = $0.36
  • Tool-call tokens: ~$0.05
  • Total Realtime cost: ~$0.55-0.60 per call

Add carrier minutes (~$0.05-0.15 for the call), bridge service costs, integration API calls, and your engineering amortization. Real all-in cost on a Realtime build is typically $0.80-$1.20 per call at meaningful volume.

For comparison, productized vendors like Open.cx ship at $0.70 per resolved conversation all-in. The headline math doesn't favor a build.

Build vs buy: the honest framing

Build on OpenAI Realtime when:

  • Voice AI is your product. You're shipping a voice AI startup; the model layer is core to your business.
  • You have a deeply unusual use case that no productized vendor fits.
  • You have 4-8 weeks of engineering bandwidth to ship the surrounding layers.
  • You want fine-grained control of the voice agent experience and are willing to maintain it forever.

Buy a productized agent when:

  • Your business is the use case (customer service, receptionist, sales outbound) and voice AI is the means.
  • You want to ship in days, not months.
  • You don't want to operate the SIP bridge, the integration layer, or the compliance posture.
  • The vendor's per-resolution price is competitive with your blended cost of building.

For most customer service buyers, the buy decision is the right one. Open.cx, PolyAI, and Sierra all ship the framework, integrations, observability, and compliance pre-built. Building the equivalent on Realtime takes 4-8 weeks minimum and leaves you maintaining the stack.

When Realtime is the right primitive

Three buyer profiles where building on Realtime makes sense:

1. Voice AI startups. The model layer is your business. You need control over how the voice agent works at the lowest level. Realtime + Pipecat + your own infrastructure = your platform.

2. Tier-1 enterprise with bespoke requirements. You want a voice agent fundamentally tuned to your brand and use case, with engineering team capable of building and operating it. Sierra is the managed-service option; Realtime + custom build is the in-house option.

3. Research and prototyping. You're exploring what's possible with voice AI. The Realtime API has the cleanest dev loop of any voice primitive in 2026.

For everyone else — and that's most teams — productized vendors are the right answer.

Where Open.cx fits

Open.cx isn't built on top of OpenAI Realtime as an exclusive primitive. We use Realtime where it's the best fit and other primitives (combined STT+LLM+TTS, ElevenLabs Conversational AI, Anthropic streaming) where they fit better. The point is the buyer doesn't have to make this choice.

Open ships:

  • The carrier layer — 37+ first-class SIP integrations.
  • The tool layer — 50+ pre-built integrations.
  • The observability layer — recording, transcripts, reasoning traces, outcome tags.
  • The compliance layer — SOC 2 Type II, HIPAA-ready, PCI-ready.
  • The operational layer — agent versioning, A/B testing, prompt management.
  • The model layer — Realtime (and others) under the hood, abstracted from you.

Per-resolution pricing at $0.70 covers the whole stack. A build on Realtime needs to absorb model cost + carrier + bridge + integrations + ops + amortized engineering — typically $0.80-$1.50 per call at production scale, plus 4-8 weeks of build time, plus ongoing operations.

When to pick what

ScenarioPick
Voice AI startup, model is coreOpenAI Realtime + Pipecat / custom build
Customer service, 1-7 day deployOpen.cx
Tier-1 brand wanting managed buildSierra
Research / prototypingOpenAI Realtime alone (no carrier needed initially)
Multilingual hospitality at scalePolyAI (managed) or Open.cx
SMB receptionist (10-100 calls/day)Open.cx, Goodcall, Synthflow
Large enterprise CCaaS-ledDecagon

Further reading

Frequently Asked Questions