Which LLM for real-time applications in June 2026?
Gemini 3.5 Flash leads at 216 tok/s for sub-second responses. GPT-5.4 and GLM 5.2 are alternatives when quality or cost matter more than peak speed.
For real-time applications, Gemini 3.5 Flash (Google) is the default choice. It generates 216 tokens per second at $3.38 per million input tokens with a quality index of 50.2, the fastest model in the field by a wide margin. Use GPT-5.4 (OpenAI) when response quality matters more than peak speed, or GLM 5.2 (Z AI) when you need an open-source deployment.
What speed numbers mean operationally
Latency in real-time apps comes from two places: time to first token, and sustained generation rate. At 200+ tok/s, a 300-token reply finishes in under 1.5 seconds, fast enough for voice agents, chat overlays, and live autocomplete. At 150 tok/s, you see a perceptible pause. Below 100 tok/s, the interaction starts to feel broken for users expecting instant feedback.
Speed also shapes iteration loops. A 216 tok/s model means a developer running a 200-token prompt waits about 0.9 seconds for the full response. At 61 tok/s, the same prompt takes 3.3 seconds. Over a debugging session of 50 prompts, that difference compounds to over two minutes of waiting.
The speed leaders
The top of the speed table is dominated by Google's Flash line:
| Model | Speed | Quality | Price/1M |
|---|---|---|---|
| Gemini 3.5 Flash (Google) | 216 tok/s | 50.2 | $3.38 |
| Gemini 3.5 Flash medium (Google) | 215 tok/s | 45.4 | $3.38 |
| GPT-5.4 (OpenAI) | 157 tok/s | 51.4 | $5.63 |
| Gemini 3.1 Pro Preview (Google) | 135 tok/s | 46.5 | $4.50 |
| GLM 5.2 (Z AI) | 98 tok/s | 51.1 | $1.92 |
Gemini 3.5 Flash is roughly 37% faster than GPT-5.4 and 60% faster than Gemini 3.1 Pro Preview. The medium variant of Flash trades 4.8 quality points for a 1 tok/s gain. Not worth it.
Quality versus speed
Real-time apps don't always need the highest quality. A chatbot confirming an order can run on a 45 quality model. A coding assistant explaining a bug cannot.
GPT-5.4 at 157 tok/s posts a quality of 51.4, the highest among sub-160 tok/s models. It costs 67% more than Gemini 3.5 Flash, but the quality delta of 1.2 points and the much stronger reasoning may justify it for customer-facing assistants where wrong answers create support tickets.
GLM 5.2 sits at 98 tok/s with quality 51.1, matching GPT-5.4 within noise. It is the only open-source model in the top tier by quality, and at $1.92/M it costs 43% less than Gemini 3.5 Flash. The trade-off is speed: GLM 5.2 is less than half as fast.
When Gemini 3.5 Flash breaks
The 50.2 quality index is solid for most real-time tasks, but it falls short on math, complex reasoning, and nuanced instruction following. For voice agents handling structured intake or chat systems doing entity extraction, it is enough. For agents that need to reason over multiple documents or solve multi-step problems, quality drops become user-visible errors.
OpenAI's GPT-5.5 (54.8 quality) runs at 65 tok/s, almost 3x slower than Flash. For a sub-200-token reply, that's a 3-second generation. Acceptable for analytical chat, painful for autocomplete.
Decision table
| Scenario | Recommended model | Why |
|---|---|---|
| Voice agent, sub-1s replies needed | Gemini 3.5 Flash | 216 tok/s, sufficient quality for scripted flows |
| Live coding autocomplete | Gemini 3.5 Flash | Latency dominates; 50 quality handles inline completions |
| Customer-facing chat where errors cost money | GPT-5.4 | 51.4 quality at 157 tok/s balances speed and correctness |
| Self-hosted real-time pipeline | GLM 5.2 | Open source, 51.1 quality, no API dependency |
| High-volume batch summarization where 100 tok/s is fine | Qwen3.7 Max | $1.88/M open source, 96 tok/s, 46 quality |
Real-time on a budget
If you cannot use the Google API and need both open source and speed, GLM 5.2 is the choice. If even 98 tok/s is too slow, the only path is paying for Gemini 3.5 Flash or GPT-5.4 through their hosted APIs. The open-source segment currently tops out below 100 tok/s for models with usable quality.
Recommendation
Start with Gemini 3.5 Flash for any latency-sensitive workload. It is the fastest model available and costs less than $3.50 per million tokens. Measure your error rate in production. If it stays below 2%, Flash is enough. If wrong answers are creating support load, switch to GPT-5.4 and accept the 27% latency hit for the 1.2-point quality gain. For self-hosted deployments, GLM 5.2 is the only realistic option until the open-source ecosystem ships a faster model.
Explore all real-time-ready models or use the LLM Selector to filter by your latency budget.TITLE: Which LLM for real-time applications in June 2026? DESCRIPTION: Gemini 3.5 Flash leads at 216 tok/s for sub-second responses. GPT-5.4 and GLM 5.2 are alternatives when quality or cost matter more than peak speed. TAGS: low-latency, real-time, speed, inference MODELS: gemini-3-5-flash, gpt-5-4, glm-5-2
For real-time applications, Gemini 3.5 Flash (Google) is the default choice. It generates 216 tokens per second at $3.38 per million input tokens with a quality index of 50.2, the fastest model in the field by a wide margin. Use GPT-5.4 (OpenAI) when response quality matters more than peak speed, or GLM 5.2 (Z AI) when you need an open-source deployment.
What speed numbers mean operationally
Latency in real-time apps comes from two places: time to first token, and sustained generation rate. At 200+ tok/s, a 300-token reply finishes in under 1.5 seconds, fast enough for voice agents, chat overlays, and live autocomplete. At 150 tok/s, you see a perceptible pause. Below 100 tok/s, the interaction starts to feel broken for users expecting instant feedback.
Speed also shapes iteration loops. A 216 tok/s model means a developer running a 200-token prompt waits about 0.9 seconds for the full response. At 61 tok/s, the same prompt takes 3.3 seconds. Over a debugging session of 50 prompts, that difference compounds to over two minutes of waiting.
The speed leaders
The top of the speed table is dominated by Google's Flash line:
| Model | Speed | Quality | Price/1M |
|---|---|---|---|
| Gemini 3.5 Flash (Google) | 216 tok/s | 50.2 | $3.38 |
| Gemini 3.5 Flash medium (Google) | 215 tok/s | 45.4 | $3.38 |
| GPT-5.4 (OpenAI) | 157 tok/s | 51.4 | $5.63 |
| Gemini 3.1 Pro Preview (Google) | 135 tok/s | 46.5 | $4.50 |
| GLM 5.2 (Z AI) | 98 tok/s | 51.1 | $1.92 |
Gemini 3.5 Flash is roughly 37% faster than GPT-5.4 and 60% faster than Gemini 3.1 Pro Preview. The medium variant of Flash trades 4.8 quality points for a 1 tok/s gain. Not worth it.
Quality versus speed
Real-time apps don't always need the highest quality. A chatbot confirming an order can run on a 45 quality model. A coding assistant explaining a bug cannot.
GPT-5.4 at 157 tok/s posts a quality of 51.4, the highest among sub-160 tok/s models. It costs 67% more than Gemini 3.5 Flash, but the quality delta of 1.2 points and the much stronger reasoning may justify it for customer-facing assistants where wrong answers create support tickets.
GLM 5.2 sits at 98 tok/s with quality 51.1, matching GPT-5.4 within noise. It is the only open-source model in the top tier by quality, and at $1.92/M it costs 43% less than Gemini 3.5 Flash. The trade-off is speed: GLM 5.2 is less than half as fast.
When Gemini 3.5 Flash breaks
The 50.2 quality index is solid for most real-time tasks, but it falls short on math, complex reasoning, and nuanced instruction following. For voice agents handling structured intake or chat systems doing entity extraction, it is enough. For agents that need to reason over multiple documents or solve multi-step problems, quality drops become user-visible errors.
OpenAI's GPT-5.5 (54.8 quality) runs at 65 tok/s, almost 3x slower than Flash. For a sub-200-token reply, that's a 3-second generation. Acceptable for analytical chat, painful for autocomplete.
Decision table
| Scenario | Recommended model | Why |
|---|---|---|
| Voice agent, sub-1s replies needed | Gemini 3.5 Flash | 216 tok/s, sufficient quality for scripted flows |
| Live coding autocomplete | Gemini 3.5 Flash | Latency dominates; 50 quality handles inline completions |
| Customer-facing chat where errors cost money | GPT-5.4 | 51.4 quality at 157 tok/s balances speed and correctness |
| Self-hosted real-time pipeline | GLM 5.2 | Open source, 51.1 quality, no API dependency |
| High-volume batch summarization where 100 tok/s is fine | Qwen3.7 Max | $1.88/M open source, 96 tok/s, 46 quality |
Real-time on a budget
If you cannot use the Google API and need both open source and speed, GLM 5.2 is the choice. If even 98 tok/s is too slow, the only path is paying for Gemini 3.5 Flash or GPT-5.4 through their hosted APIs. The open-source segment currently tops out below 100 tok/s for models with usable quality.
Recommendation
Start with Gemini 3.5 Flash for any latency-sensitive workload. It is the fastest model available and costs less than $3.50 per million tokens. Measure your error rate in production. If it stays below 2%, Flash is enough. If wrong answers are creating support load, switch to GPT-5.4 and accept the 27% latency hit for the 1.2-point quality gain. For self-hosted deployments, GLM 5.2 is the only realistic option until the open-source ecosystem ships a faster model.
Explore all real-time-ready models or use the LLM Selector to filter by your latency budget.
Stay in the loop
Weekly LLM analysis delivered to your inbox. No spam.