How to pick the right LLM for a production chatbot
A practical guide to balancing quality, latency, and cost when choosing an LLM for interactive chatbot use cases in production.
The best LLM for a production chatbot is almost never the highest-quality model. For interactive use cases, you need to optimize across three axes simultaneously: response quality, inference latency, and per-token cost. The right choice depends on your traffic volume, your tolerance for slower replies, and how much quality you can trade away before users notice.
Why latency matters more than you think
Users in a chat interface expect responses to begin within ~500ms and stream at a pace that feels natural. That means output speed is a hard constraint, not a nice-to-have. A model generating 43 tok/s will feel sluggish on long answers; one generating 237 tok/s will feel instant.
GPT-5.4 Mini (OpenAI) is the standout here: 237 tok/s output speed, a quality index of 48.1, and $1.69/M tokens. Compare that to Claude Opus 4.6 (Adaptive Reasoning) (Anthropic), which scores 53.0 on quality but crawls at 51 tok/s and costs $10.00/M tokens. That's a 4.6x speed difference and a 5.9x cost difference for 5 points of quality. In most chatbot scenarios, users won't perceive that quality gap, but they will perceive the latency gap.
The cost math at scale
A chatbot serving 1M conversations/day at ~1,000 tokens per conversation burns through 1B tokens/day. At that scale, the difference between $0.52/M and $5.63/M is the difference between $520/day and $5,630/day. That's $1.86M/year in additional spend.
Here are the models worth considering for high-volume production chatbots:
| Model | Quality | Price/1M tokens | Speed | Best for |
|---|---|---|---|---|
| GPT-5.4 Mini | 48.1 | $1.69 | 237 tok/s | High-volume, latency-sensitive |
| GLM 5 (Z AI) | 49.8 | $1.11 | 89 tok/s | Budget-first, self-hostable |
| Grok 4.20 Beta (xAI) | 48.5 | $3.00 | 156 tok/s | Balanced speed + quality |
| GPT-5.4 (OpenAI) | 57.2 | $5.63 | 85 tok/s | Premium quality, lower volume |
| Gemini 3.1 Pro Preview (Google) | 57.2 | $4.50 | 117 tok/s | Premium quality, better throughput |
When to pay for premium quality
If your chatbot handles complex advisory tasks — medical triage, legal intake, financial planning — the jump from 48 to 57 on quality index matters. Gemini 3.1 Pro Preview matches GPT-5.4's 57.2 quality score while costing $1.13 less per million tokens and running 38% faster at 117 tok/s. For premium chatbot tiers, Gemini 3.1 Pro is the better deal right now.
For everything else — customer support, FAQ bots, e-commerce assistants — GPT-5.4 Mini's combination of 237 tok/s speed and $1.69/M pricing is hard to beat.
The open-source angle
GLM 5 (Z AI) deserves attention: 49.8 quality at $1.11/M tokens, and it's open source. If you're running on your own infrastructure and want to avoid API vendor lock-in, GLM 5 gives you near-GPT-5.4-Mini quality at 66% of the cost. The 89 tok/s speed is adequate for most chat interfaces, though noticeably slower on long-form responses.
My recommendation
Start with GPT-5.4 Mini for general-purpose chatbots. Route complex queries to Gemini 3.1 Pro Preview. If cost dominates your decision, evaluate GLM 5 on your own infrastructure. Don't default to the top-of-leaderboard model — your users care about snappy responses more than marginal quality gains.
Use the LLM Selector to filter by speed and price constraints, or browse the full rankings on Explore.
Stay in the loop
Weekly LLM analysis delivered to your inbox. No spam.