OpenAI's Realtime API turned voice agents from a stitched-together pipeline (speech-to-text → LLM → text-to-speech) into a single bidirectional model that takes audio in and emits audio out. Latency dropped, intonation improved, and the build pattern simplified. It's the most-talked-about voice primitive in 2026.
It's also not a complete voice agent. The model is one component; the carrier integration, the tool layer, the observability, the compliance posture, and the operational work are still your code. This piece is the practical engineering perspective on what Realtime actually solves and what it doesn't.
TL;DR
- OpenAI Realtime = audio-in/audio-out model with tool calling, sub-300ms latency, WebSocket protocol.
- What it solves: model-layer concerns. STT/LLM/TTS as one streaming pipe. Better intonation. Native tool calling.
- What it doesn't solve: SIP bridging, integrations, observability, compliance, transfers, knowledge base, deployment.
- Build with it if: voice AI is your product or you have a deeply unusual use case.
- Buy a productized agent if: your business is the use case and you want to ship in days.
What the Realtime API actually is
A WebSocket-based API. You open a connection to OpenAI, stream audio in, receive audio out (and tool-call instructions, when the model needs to invoke a tool). The model handles the speech recognition, reasoning, and speech generation in a single forward pass.
The protocol is event-based:
session.update— configure the session (voice, language, tools, system prompt).input_audio_buffer.append— stream audio in.response.audio.delta— receive audio chunks back.response.function_call_arguments.delta— tool call requests stream incrementally.
Sub-300ms first-token latency is achievable on US/EU regions. The model handles barge-in, partial utterances, and code-switching natively.
What it solves vs what it doesn't
Where an AI agent sits in the support stack
Orchestration layerHelpdesk
Ticket management, agent UI, reporting. Intercom, Zendesk, Freshdesk, HubSpot, Salesforce, Twilio Flex
AI agent
OrchestratorCustomer-facing reasoning and action execution. Native (Fin, Einstein, Freddy) or third-party (Open.cx)
Knowledge base
Source of truth for policy and procedures. Intercom Articles, Zendesk Guide, Notion, custom CMS
Identity & auth
Customer authentication. Auth0, Okta, custom SSO
Transactional systems
Orders, billing, subscriptions, fulfillment. Stripe, Shopify, custom OMS
CRM
Customer history and account context. Salesforce, HubSpot, Segment
Observability
Conversation logs, confidence sampling, replay. Platform-native, data warehouse, custom dashboards
The AI agent makes the rest of the stack invisible to the customer
The Realtime API solves the model-layer concerns of building a voice agent. Specifically:
- STT + LLM + TTS as one pipe — no stitching, no inter-service latency.
- Tool calling at the model level — the model knows how to call your tools while staying in audio mode.
- Barge-in and overlap handling — built into the model, no custom voice-activity detection needed.
- Multilingual — the model handles language detection and switching natively.
What it does NOT solve:
- Carrier integration. OpenAI Realtime speaks WebSocket; phones speak SIP. You need a bridge: Twilio Media Streams, Pipecat, or your own. This bridge is non-trivial production code.
- Integrations. No native CRM, helpdesk, calendar, or billing integrations. You build each one.
- Knowledge base. No retrieval layer, no document syncing. You bring your own.
- Observability. No transcript export, no reasoning traces, no outcome tags out of the box. You build the logging.
- Compliance. SOC 2, HIPAA, PCI — your stack, your responsibility.
- Warm transfers. The model can decide to transfer; you build the SIP REFER plumbing.
- Operational tooling. Agent versioning, A/B traffic splitting, prompt rollouts — your build.
The model is one of seven layers in a production voice agent. Realtime is the fastest, cleanest version of that one layer. The other six layers haven't gone away.
A practical architecture
If you're building on Realtime, the production stack typically looks like:
Carrier (Twilio / Vonage / your trunk)
↓ SIP
Bridge service (Twilio Media Streams / Pipecat / custom)
↓ WebSocket
OpenAI Realtime API (model layer)
↓ tool calls
Your tool layer (CRM, calendar, helpdesk, billing connectors)
↓ side effects
Observability (transcript, trace, outcome logging)
Six layers, four of which are your code. The model layer is the easy part now; the other layers are the project.
Latency in practice
Median first-token latency on Realtime is sub-300ms in US/EU regions. This is meaningfully lower than the stitched STT→LLM→TTS pattern, which typically lands at 600-900ms median.
For voice perception, the gap matters. Sub-300ms feels conversational; 600-900ms feels robotic. The Realtime improvement is real and noticeable to customers.
That said: the carrier-side path adds latency (typically 50-100ms each way for SIP-WebSocket bridging), and the tool-call layer adds latency when the agent needs to look something up before responding. End-to-end production latency typically lands at 400-600ms even with Realtime — better than the stitched pipeline, not magic.
Cost in practice
Worked example: mid-market team
Applying the five levers to a real bill.
Pricing as of mid-2026 (verify current rates on the OpenAI pricing page):
- Audio input: ~$0.06 per minute.
- Audio output: ~$0.24 per minute.
- Tool-call text: standard GPT-4o text token pricing.
For a 2-3 minute call:
- 2.5 min × $0.06 (input) = $0.15
- 1.5 min × $0.24 (output, assuming 60% talk-time on the AI side) = $0.36
- Tool-call tokens: ~$0.05
- Total Realtime cost: ~$0.55-0.60 per call
Add carrier minutes (~$0.05-0.15 for the call), bridge service costs, integration API calls, and your engineering amortization. Real all-in cost on a Realtime build is typically $0.80-$1.20 per call at meaningful volume.
For comparison, productized vendors like Open.cx ship at $0.70 per resolved conversation all-in. The headline math doesn't favor a build.
Build vs buy: the honest framing
Build on OpenAI Realtime when:
- Voice AI is your product. You're shipping a voice AI startup; the model layer is core to your business.
- You have a deeply unusual use case that no productized vendor fits.
- You have 4-8 weeks of engineering bandwidth to ship the surrounding layers.
- You want fine-grained control of the voice agent experience and are willing to maintain it forever.
Buy a productized agent when:
- Your business is the use case (customer service, receptionist, sales outbound) and voice AI is the means.
- You want to ship in days, not months.
- You don't want to operate the SIP bridge, the integration layer, or the compliance posture.
- The vendor's per-resolution price is competitive with your blended cost of building.
For most customer service buyers, the buy decision is the right one. Open.cx, PolyAI, and Sierra all ship the framework, integrations, observability, and compliance pre-built. Building the equivalent on Realtime takes 4-8 weeks minimum and leaves you maintaining the stack.
When Realtime is the right primitive
Three buyer profiles where building on Realtime makes sense:
1. Voice AI startups. The model layer is your business. You need control over how the voice agent works at the lowest level. Realtime + Pipecat + your own infrastructure = your platform.
2. Tier-1 enterprise with bespoke requirements. You want a voice agent fundamentally tuned to your brand and use case, with engineering team capable of building and operating it. Sierra is the managed-service option; Realtime + custom build is the in-house option.
3. Research and prototyping. You're exploring what's possible with voice AI. The Realtime API has the cleanest dev loop of any voice primitive in 2026.
For everyone else — and that's most teams — productized vendors are the right answer.
Where Open.cx fits
Open.cx isn't built on top of OpenAI Realtime as an exclusive primitive. We use Realtime where it's the best fit and other primitives (combined STT+LLM+TTS, ElevenLabs Conversational AI, Anthropic streaming) where they fit better. The point is the buyer doesn't have to make this choice.
Open ships:
- The carrier layer — 37+ first-class SIP integrations.
- The tool layer — 50+ pre-built integrations.
- The observability layer — recording, transcripts, reasoning traces, outcome tags.
- The compliance layer — SOC 2 Type II, HIPAA-ready, PCI-ready.
- The operational layer — agent versioning, A/B testing, prompt management.
- The model layer — Realtime (and others) under the hood, abstracted from you.
Per-resolution pricing at $0.70 covers the whole stack. A build on Realtime needs to absorb model cost + carrier + bridge + integrations + ops + amortized engineering — typically $0.80-$1.50 per call at production scale, plus 4-8 weeks of build time, plus ongoing operations.
When to pick what
| Scenario | Pick |
|---|---|
| Voice AI startup, model is core | OpenAI Realtime + Pipecat / custom build |
| Customer service, 1-7 day deploy | Open.cx |
| Tier-1 brand wanting managed build | Sierra |
| Research / prototyping | OpenAI Realtime alone (no carrier needed initially) |
| Multilingual hospitality at scale | PolyAI (managed) or Open.cx |
| SMB receptionist (10-100 calls/day) | Open.cx, Goodcall, Synthflow |
| Large enterprise CCaaS-led | Decagon |