Which LLM for real-time applications in June 2026?

Gemini 3.5 Flash leads at 216 tok/s for sub-second responses. GPT-5.4 and GLM 5.2 are alternatives when quality or cost matter more than peak speed.

FindLLMJune 20, 2026

low-latencyreal-timespeedinference

For real-time applications, Gemini 3.5 Flash (Google) is the default choice. It generates 216 tokens per second at $3.38 per million input tokens with a quality index of 50.2, the fastest model in the field by a wide margin. Use GPT-5.4 (OpenAI) when response quality matters more than peak speed, or GLM 5.2 (Z AI) when you need an open-source deployment.

What speed numbers mean operationally

Latency in real-time apps comes from two places: time to first token, and sustained generation rate. At 200+ tok/s, a 300-token reply finishes in under 1.5 seconds, fast enough for voice agents, chat overlays, and live autocomplete. At 150 tok/s, you see a perceptible pause. Below 100 tok/s, the interaction starts to feel broken for users expecting instant feedback.

Speed also shapes iteration loops. A 216 tok/s model means a developer running a 200-token prompt waits about 0.9 seconds for the full response. At 61 tok/s, the same prompt takes 3.3 seconds. Over a debugging session of 50 prompts, that difference compounds to over two minutes of waiting.

Output speed

The speed leaders

The top of the speed table is dominated by Google's Flash line:

Model	Speed	Quality	Price/1M
Gemini 3.5 Flash (Google)	216 tok/s	50.2	$3.38
Gemini 3.5 Flash medium (Google)	215 tok/s	45.4	$3.38
GPT-5.4 (OpenAI)	157 tok/s	51.4	$5.63
Gemini 3.1 Pro Preview (Google)	135 tok/s	46.5	$4.50
GLM 5.2 (Z AI)	98 tok/s	51.1	$1.92

Gemini 3.5 Flash is roughly 37% faster than GPT-5.4 and 60% faster than Gemini 3.1 Pro Preview. The medium variant of Flash trades 4.8 quality points for a 1 tok/s gain. Not worth it.

Quality versus speed

Real-time apps don't always need the highest quality. A chatbot confirming an order can run on a 45 quality model. A coding assistant explaining a bug cannot.

GPT-5.4 at 157 tok/s posts a quality of 51.4, the highest among sub-160 tok/s models. It costs 67% more than Gemini 3.5 Flash, but the quality delta of 1.2 points and the much stronger reasoning may justify it for customer-facing assistants where wrong answers create support tickets.

GLM 5.2 sits at 98 tok/s with quality 51.1, matching GPT-5.4 within noise. It is the only open-source model in the top tier by quality, and at $1.92/M it costs 43% less than Gemini 3.5 Flash. The trade-off is speed: GLM 5.2 is less than half as fast.

Quality comparison

When Gemini 3.5 Flash breaks

The 50.2 quality index is solid for most real-time tasks, but it falls short on math, complex reasoning, and nuanced instruction following. For voice agents handling structured intake or chat systems doing entity extraction, it is enough. For agents that need to reason over multiple documents or solve multi-step problems, quality drops become user-visible errors.

OpenAI's GPT-5.5 (54.8 quality) runs at 65 tok/s, almost 3x slower than Flash. For a sub-200-token reply, that's a 3-second generation. Acceptable for analytical chat, painful for autocomplete.

Decision table

Scenario	Recommended model	Why
Voice agent, sub-1s replies needed	Gemini 3.5 Flash	216 tok/s, sufficient quality for scripted flows
Live coding autocomplete	Gemini 3.5 Flash	Latency dominates; 50 quality handles inline completions
Customer-facing chat where errors cost money	GPT-5.4	51.4 quality at 157 tok/s balances speed and correctness
Self-hosted real-time pipeline	GLM 5.2	Open source, 51.1 quality, no API dependency
High-volume batch summarization where 100 tok/s is fine	Qwen3.7 Max	$1.88/M open source, 96 tok/s, 46 quality

Real-time on a budget

If you cannot use the Google API and need both open source and speed, GLM 5.2 is the choice. If even 98 tok/s is too slow, the only path is paying for Gemini 3.5 Flash or GPT-5.4 through their hosted APIs. The open-source segment currently tops out below 100 tok/s for models with usable quality.

Recommendation

Start with Gemini 3.5 Flash for any latency-sensitive workload. It is the fastest model available and costs less than $3.50 per million tokens. Measure your error rate in production. If it stays below 2%, Flash is enough. If wrong answers are creating support load, switch to GPT-5.4 and accept the 27% latency hit for the 1.2-point quality gain. For self-hosted deployments, GLM 5.2 is the only realistic option until the open-source ecosystem ships a faster model.

Explore all real-time-ready models or use the LLM Selector to filter by your latency budget.TITLE: Which LLM for real-time applications in June 2026? DESCRIPTION: Gemini 3.5 Flash leads at 216 tok/s for sub-second responses. GPT-5.4 and GLM 5.2 are alternatives when quality or cost matter more than peak speed. TAGS: low-latency, real-time, speed, inference MODELS: gemini-3-5-flash, gpt-5-4, glm-5-2