Which LLM for low-latency real-time applications in June 2026?
A prescriptive guide to choosing LLMs for real-time workloads where inference latency and tokens per second dominate the user experience.
For real-time applications where users wait on every token, use Gemini 3.5 Flash (Google). It outputs 212 tokens per second at $3.38/1M tokens with a quality index of 55.3, the fastest model in this dataset by a clear margin. If you need open weights or a lower price floor, Qwen3.7 Max (Alibaba) runs at 166 tok/s for $1.88/1M and scores 56.6 on quality.
The decision in latency-bound workloads is rarely about peak quality. It is about how fast the first token arrives and how fast the rest stream. The frontier models, Claude Opus 4.8 (Anthropic) at 58 tok/s and GPT-5.5 (OpenAI) at 64 tok/s, are too slow for voice agents, autocomplete, or live transcription overlays. You pay more and wait longer for marginal quality.
What "real-time" actually demands
Tokens per second sets the ceiling on perceived responsiveness once generation starts. A voice agent generating a 150-token reply at 58 tok/s takes about 2.6 seconds to finish speaking. At 212 tok/s the same reply lands in 0.7 seconds. That gap is the difference between a usable conversation and one users abandon.
For interactive UI (search-as-you-type, inline suggestions, chat with streaming), I weight speed heavily and treat quality as a floor, not a target. A model that scores 55 and streams at 212 tok/s will beat a model scoring 61 at 58 tok/s in nearly every real-time product.
Top three picks for latency-bound workloads
| Model | Quality | Price/1M | Speed | Open weights |
|---|---|---|---|---|
| Gemini 3.5 Flash | 55.3 | $3.38 | 212 tok/s | No |
| Qwen3.7 Max | 56.6 | $1.88 | 166 tok/s | Yes |
| Gemini 3.1 Pro Preview | 57.2 | $4.50 | 126 tok/s | No |
Gemini 3.5 Flash wins on raw throughput. It is the model I reach for first when streaming latency is the primary constraint and the workload tolerates a quality index in the mid-50s.
Qwen3.7 Max is the value play. At $1.88/1M it is roughly half the price of Gemini 3.5 Flash, scores marginally higher (56.6 versus 55.3), and gives up about 22% of throughput. For high-volume batch-plus-interactive pipelines where cost compounds, that trade favors Qwen. Open weights also mean you can self-host and remove network round-trips entirely.
Gemini 3.1 Pro Preview is the option when quality matters more but you still refuse to drop below 100 tok/s. At 126 tok/s and a 57.2 quality index, it sits between the fast tier and the slow frontier. It costs more per token than Qwen but gives you the highest quality among models that stay genuinely fast.
Decision table
| Scenario | Use this | Why |
|---|---|---|
| Voice agents, live transcription | Gemini 3.5 Flash | 212 tok/s keeps spoken replies under one second |
| Cost-sensitive interactive product at scale | Qwen3.7 Max | $1.88/1M and 166 tok/s; open weights cut hosting cost |
| Self-hosted, no external API latency | Qwen3.7 Max | Open weights remove network round-trips |
| Higher quality without dropping below 100 tok/s | Gemini 3.1 Pro Preview | 126 tok/s at 57.2 quality |
| Autocomplete / inline suggestions | Gemini 3.5 Flash | Throughput dominates short-completion UX |
The trade-offs I would not ignore
Speed numbers describe steady-state generation, not time to first token. Network distance to the provider and prompt length both add latency before the first token streams. If your users are far from Google or Alibaba endpoints, a self-hosted Qwen3.7 Max deployment near your traffic can beat a faster model served from across the world.
Quality at the mid-50s is fine for conversational replies, classification, and short structured output. It is not fine for multi-step reasoning or code generation where errors cascade. If your real-time workload includes either, route those requests to a higher-quality model and accept the latency hit on that path only. Splitting traffic by task beats forcing one model to do everything.
Price matters most when retries dominate. A 212 tok/s model that needs frequent reprompting can cost more in practice than a slower, more reliable one. Measure your actual retry rate before committing.
My recommendation
Default to Gemini 3.5 Flash for real-time products where streaming latency is the dominant constraint. Move to Qwen3.7 Max when cost at scale or self-hosting wins out, and you can absorb the throughput drop. Reach for Gemini 3.1 Pro Preview only when you need the extra quality and refuse to go below 100 tok/s.
Compare throughput and price for your own traffic mix in the LLM Selector, or browse the full field on Explore.
Stay in the loop
Weekly LLM analysis delivered to your inbox. No spam.