Qwen3 VL 30B A3B Instruct

🧠 AI Modelqwen

30B-parameter multimodal model with 3B activated, excelling in image/video understanding.

Qwen3-VL-30B-A3B-Instruct is an instruction-tuned multimodal model based on a Mixture-of-Experts (MoE) architecture with 30 billion total parameters, of which only 3 billion are activated per token. This design enables efficient inference while maintaining high performance on vision-language tasks. The model supports both image and video inputs, processing visual data alongside text to generate coherent textual responses. Key features include a context length of 262,144 tokens, making it suitable for long-form video understanding. Pricing is $0.13 per million input tokens and $0.52 per million output tokens. The model is available via OpenRouter and supports common inference parameters like frequency penalty, logit bias, and response format. It achieves strong benchmarks on multimodal reasoning tasks, particularly in perception and instruction-following scenarios.

💡Highlights

├─30B MoE, only 3B activated per token
├─262,144-token context for long videos
└─Text & image input, text output

🎯For

├─AI researchers
├─multimodal developers
└─content creators

🔗Links

└─OpenRouter Page