What is the OpenAI Realtime API?

OpenAI's Realtime API is a WebSocket-based API for building voice agents directly on top of OpenAI models. It accepts streaming audio in, returns streaming audio out, and supports tool calling — all in a single bidirectional connection. Sub-300ms first-token latency is achievable. Released October 2024 and matured significantly through 2025-2026.

How is OpenAI Realtime different from STT + LLM + TTS?

Traditional voice agents stitch three services: speech-to-text → LLM → text-to-speech. Each hop adds latency. OpenAI Realtime does it in a single model that takes audio in and emits audio out, plus tool calls when needed. Latency is meaningfully lower, intonation is more natural, and barge-in handling is built-in. Trade-off: less flexibility on individual model choices.

How much does OpenAI Realtime cost?

Per-minute pricing on audio input and output, separately metered. As of mid-2026, GPT-4o Realtime pricing is roughly $0.06 per minute of audio input and $0.24 per minute of audio output, plus token costs for the text portion of tool-calling. Total cost for a typical 2-3 minute customer-service call: $0.30-$0.60 just on the model, before telephony costs.

Can I use OpenAI Realtime for a phone agent?

Yes, but you have to bridge it to the phone network yourself. The Realtime API talks WebSocket; phones talk SIP. You need a bridge service (Twilio Media Streams, Pipecat, or a custom one) to connect the carrier-side SIP audio to the OpenAI WebSocket. This bridge is the part that takes engineering work.

Should I build with OpenAI Realtime or use a productized vendor?

Build with Realtime if voice AI is your product or you have unusual requirements that no vendor fits. Buy a productized vendor (Open.cx, PolyAI, Sierra) if your business is the use case (customer service, receptionist, sales outbound) and you want to ship in days. Most teams should buy.

What integrations do I get with OpenAI Realtime?

None out of the box. The Realtime API is a model API; integrations to your CRM, calendar, helpdesk, billing, and carrier are your code. Productized vendors ship 50+ integrations pre-built; with Realtime you build each one.

How long does it take to ship a production voice agent on Realtime?

For a focused use case (one carrier, one CRM, one call type): 4-8 weeks of engineering minimum. The bridge from SIP to Realtime, the tool integration layer, the observability, the human transfer paths, and the compliance work all add up. For comparison, Open.cx ships in 1-14 days because all of that is pre-built.

Are there alternatives to OpenAI Realtime?

Yes. Anthropic's Claude streaming, Google Gemini Live, ElevenLabs Conversational AI, and infrastructure platforms (Vapi, Bland, Retell) that abstract the underlying model. The model layer is increasingly commoditized — the differentiation in 2026 is in the application layer above the model.

Building a voice agent with the OpenAI Realtime API

OpenAI's Realtime API turned voice agents from a stitched-together pipeline (speech-to-text → LLM → text-to-speech) into a single bidirectional model that takes audio in and emits audio out. Latency dropped, intonation improved, and the build pattern simplified. It's the most-talked-about voice primitive in 2026.

It's also not a complete voice agent. The model is one component; the carrier integration, the tool layer, the observability, the compliance posture, and the operational work are still your code. This piece is the practical engineering perspective on what Realtime actually solves and what it doesn't.

TL;DR

OpenAI Realtime = audio-in/audio-out model with tool calling, sub-300ms latency, WebSocket protocol.
What it solves: model-layer concerns. STT/LLM/TTS as one streaming pipe. Better intonation. Native tool calling.
What it doesn't solve: SIP bridging, integrations, observability, compliance, transfers, knowledge base, deployment.
Build with it if: voice AI is your product or you have a deeply unusual use case.
Buy a productized agent if: your business is the use case and you want to ship in days.

What the Realtime API actually is

A WebSocket-based API. You open a connection to OpenAI, stream audio in, receive audio out (and tool-call instructions, when the model needs to invoke a tool). The model handles the speech recognition, reasoning, and speech generation in a single forward pass.

The protocol is event-based:

session.update — configure the session (voice, language, tools, system prompt).
input_audio_buffer.append — stream audio in.
response.audio.delta — receive audio chunks back.
response.function_call_arguments.delta — tool call requests stream incrementally.

Sub-300ms first-token latency is achievable on US/EU regions. The model handles barge-in, partial utterances, and code-switching natively.

What it solves vs what it doesn't

Where an AI agent sits in the support stack

Orchestration layer

Helpdesk
Ticket management, agent UI, reporting. Intercom, Zendesk, Freshdesk, HubSpot, Salesforce, Twilio Flex
AI agent
Orchestrator
Customer-facing reasoning and action execution. Native (Fin, Einstein, Freddy) or third-party (Open.cx)
Knowledge base
Source of truth for policy and procedures. Intercom Articles, Zendesk Guide, Notion, custom CMS
Identity & auth
Customer authentication. Auth0, Okta, custom SSO
Transactional systems
Orders, billing, subscriptions, fulfillment. Stripe, Shopify, custom OMS
CRM
Customer history and account context. Salesforce, HubSpot, Segment
Observability
Conversation logs, confidence sampling, replay. Platform-native, data warehouse, custom dashboards

The AI agent makes the rest of the stack invisible to the customer

The Realtime API solves the model-layer concerns of building a voice agent. Specifically:

STT + LLM + TTS as one pipe — no stitching, no inter-service latency.
Tool calling at the model level — the model knows how to call your tools while staying in audio mode.
Barge-in and overlap handling — built into the model, no custom voice-activity detection needed.
Multilingual — the model handles language detection and switching natively.

What it does NOT solve:

Carrier integration. OpenAI Realtime speaks WebSocket; phones speak SIP. You need a bridge: Twilio Media Streams, Pipecat, or your own. This bridge is non-trivial production code.
Integrations. No native CRM, helpdesk, calendar, or billing integrations. You build each one.
Knowledge base. No retrieval layer, no document syncing. You bring your own.
Observability. No transcript export, no reasoning traces, no outcome tags out of the box. You build the logging.
Compliance. SOC 2, HIPAA, PCI — your stack, your responsibility.
Warm transfers. The model can decide to transfer; you build the SIP REFER plumbing.
Operational tooling. Agent versioning, A/B traffic splitting, prompt rollouts — your build.

The model is one of seven layers in a production voice agent. Realtime is the fastest, cleanest version of that one layer. The other six layers haven't gone away.

A practical architecture

If you're building on Realtime, the production stack typically looks like:

Carrier (Twilio / Vonage / your trunk)
    ↓ SIP
Bridge service (Twilio Media Streams / Pipecat / custom)
    ↓ WebSocket
OpenAI Realtime API (model layer)
    ↓ tool calls
Your tool layer (CRM, calendar, helpdesk, billing connectors)
    ↓ side effects
Observability (transcript, trace, outcome logging)

Six layers, four of which are your code. The model layer is the easy part now; the other layers are the project.

Latency in practice

Median first-token latency on Realtime is sub-300ms in US/EU regions. This is meaningfully lower than the stitched STT→LLM→TTS pattern, which typically lands at 600-900ms median.

For voice perception, the gap matters. Sub-300ms feels conversational; 600-900ms feels robotic. The Realtime improvement is real and noticeable to customers.

That said: the carrier-side path adds latency (typically 50-100ms each way for SIP-WebSocket bridging), and the tool-call layer adds latency when the agent needs to look something up before responding. End-to-end production latency typically lands at 400-600ms even with Realtime — better than the stitched pipeline, not magic.

Cost in practice

Worked example: mid-market team

Applying the five levers to a real bill.

$2,245 saved / mo · 32% lower

Line item

Before

After

Lever

Seats (Advanced)

Before$1,275

After$1,020

Lever 1

Fin AI

Before$4,950

After$1,560

Levers 3 + 4

Outcome-priced AI

Before$0

After$2,100

Lever 4

Phone

Before$400

After$0

Lever 2

Proactive Support+

Before$150

After$50

Lever 2

KB widget (extra)

Before$200

After$0

Lever 5

Monthly total

Before$6,975

After$4,730

All five

Pricing as of mid-2026 (verify current rates on the OpenAI pricing page):

Audio input: ~$0.06 per minute.
Audio output: ~$0.24 per minute.
Tool-call text: standard GPT-4o text token pricing.

For a 2-3 minute call:

2.5 min × $0.06 (input) = $0.15
1.5 min × $0.24 (output, assuming 60% talk-time on the AI side) = $0.36
Tool-call tokens: ~$0.05
Total Realtime cost: ~$0.55-0.60 per call

Add carrier minutes (~$0.05-0.15 for the call), bridge service costs, integration API calls, and your engineering amortization. Real all-in cost on a Realtime build is typically $0.80-$1.20 per call at meaningful volume.

For comparison, productized vendors like Open.cx ship at $0.70 per resolved conversation all-in. The headline math doesn't favor a build.

Build vs buy: the honest framing

Build on OpenAI Realtime when:

Voice AI is your product. You're shipping a voice AI startup; the model layer is core to your business.
You have a deeply unusual use case that no productized vendor fits.
You have 4-8 weeks of engineering bandwidth to ship the surrounding layers.
You want fine-grained control of the voice agent experience and are willing to maintain it forever.

Buy a productized agent when:

Your business is the use case (customer service, receptionist, sales outbound) and voice AI is the means.
You want to ship in days, not months.
You don't want to operate the SIP bridge, the integration layer, or the compliance posture.
The vendor's per-resolution price is competitive with your blended cost of building.

For most customer service buyers, the buy decision is the right one. Open.cx, PolyAI, and Sierra all ship the framework, integrations, observability, and compliance pre-built. Building the equivalent on Realtime takes 4-8 weeks minimum and leaves you maintaining the stack.

When Realtime is the right primitive

Three buyer profiles where building on Realtime makes sense:

1. Voice AI startups. The model layer is your business. You need control over how the voice agent works at the lowest level. Realtime + Pipecat + your own infrastructure = your platform.

2. Tier-1 enterprise with bespoke requirements. You want a voice agent fundamentally tuned to your brand and use case, with engineering team capable of building and operating it. Sierra is the managed-service option; Realtime + custom build is the in-house option.

3. Research and prototyping. You're exploring what's possible with voice AI. The Realtime API has the cleanest dev loop of any voice primitive in 2026.

For everyone else — and that's most teams — productized vendors are the right answer.

Where Open.cx fits

Open.cx isn't built on top of OpenAI Realtime as an exclusive primitive. We use Realtime where it's the best fit and other primitives (combined STT+LLM+TTS, ElevenLabs Conversational AI, Anthropic streaming) where they fit better. The point is the buyer doesn't have to make this choice.

Open ships:

The carrier layer — 37+ first-class SIP integrations.
The tool layer — 50+ pre-built integrations.
The observability layer — recording, transcripts, reasoning traces, outcome tags.
The compliance layer — SOC 2 Type II, HIPAA-ready, PCI-ready.
The operational layer — agent versioning, A/B testing, prompt management.
The model layer — Realtime (and others) under the hood, abstracted from you.

Per-resolution pricing at $0.70 covers the whole stack. A build on Realtime needs to absorb model cost + carrier + bridge + integrations + ops + amortized engineering — typically $0.80-$1.50 per call at production scale, plus 4-8 weeks of build time, plus ongoing operations.

When to pick what

Scenario	Pick
Voice AI startup, model is core	OpenAI Realtime + Pipecat / custom build
Customer service, 1-7 day deploy	Open.cx
Tier-1 brand wanting managed build	Sierra
Research / prototyping	OpenAI Realtime alone (no carrier needed initially)
Multilingual hospitality at scale	PolyAI (managed) or Open.cx
SMB receptionist (10-100 calls/day)	Open.cx, Goodcall, Synthflow
Large enterprise CCaaS-led	Decagon