Loading...
Loading...
Qwen3-VL-30B-A3B-Instruct is a multimodal model that unifies strong text generation with visual understanding for images and videos. Its Instruct variant optimizes instruction-following for general multimodal tasks. It excels in perception of real-world/synthetic categories, 2D/3D spatial grounding, and long-form visual comprehension, achieving competitive multimodal benchmark results. For agentic use, it handles multi-image multi-turn instructions, video timeline alignments, GUI automation, and visual coding from sketches to debugged UI. Text performance matches flagship Qwen3 models, suiting document AI, OCR, UI assistance, spatial tasks, and agent research.
Quality Index
16.1
243rd of 444
Top 55%
Coding Index
14.3
204th of 354
Top 58%
Math Index
72.3
89th of 268
Top 34%
Price/1M
$0.35
352nd cheapest
17% above median
Top 53%
Speed
118 tok/s
Top 22%
TTFT
0.98s
Context Window
131K
145th largest
Top 63%
Input
$0.20
per 1M tokens
Output
$0.80
per 1M tokens
Blended
$0.35
per 1M tokens
Cheaper than 47% of models. Median price is $0.30/1M tokens.
Daily
$0.35
Monthly
$10.50
118
tokens/sec
Faster than 78% of models
0.98
seconds
Faster than 31% of models
0.98
seconds
Faster than 39% of models
Market Median
45 tok/s
161% faster
Median TTFT
0.42s
134% slower
Throughput/Dollar
338
tok/s per $/1M
Speed Comparison
Context Window
131K
tokens
Larger than 37% of models
Max Output
33K
tokens
25% of context
4.7M
552
24-48 GB
A6000 / M3 Ultra