Loading...
Loading...
Qwen3-VL-8B-Instruct is a multimodal vision-language model from the Qwen3-VL series, built for high-fidelity understanding and reasoning across text, images, and video. It features improved multimodal fusion with Interleaved-MRoPE for long-horizon temporal reasoning, DeepStack for fine-grained visual-text alignment, and text-timestamp alignment for precise event localization. The model supports a native 256K-token context window, extensible to 1M tokens, and handles both static and dynamic media inputs for tasks like document parsing, visual question answering, spatial reasoning, and GUI control. It achieves text understanding comparable to leading LLMs while expanding OCR coverage to 32 languages and enhancing robustness under varied visual conditions.
Quality Index
14.3
284th of 444
Top 64%
Coding Index
7.3
294th of 354
Top 83%
Math Index
27.3
189th of 268
Top 71%
Price/1M
$0.31
345th cheapest
3% above median
Top 51%
Speed
139 tok/s
Top 15%
TTFT
1.07s
Context Window
131K
145th largest
Top 63%
Input
$0.18
per 1M tokens
Output
$0.70
per 1M tokens
Blended
$0.31
per 1M tokens
Cheaper than 49% of models. Median price is $0.30/1M tokens.
Daily
$0.31
Monthly
$9.30
139
tokens/sec
Faster than 85% of models
1.07
seconds
Faster than 26% of models
1.07
seconds
Faster than 37% of models
Market Median
45 tok/s
207% faster
Median TTFT
0.42s
156% slower
Throughput/Dollar
450
tok/s per $/1M
Speed Comparison
Context Window
131K
tokens
Larger than 37% of models
Max Output
33K
tokens
25% of context
7.7M
835
8-16 GB
RTX 4070 / M2 Pro