ByteDance: UI-TARS 7B

🧠 AI Modelbytedance

Multimodal vision-language agent for GUI automation, optimized for desktop, web, mobile, and games.

UI-TARS-1.5 (7B parameters) is a state-of-the-art multimodal vision-language agent developed by ByteDance. It excels in GUI-based environments, including desktop applications, web browsing, mobile interfaces, and gaming. The model accepts both image and text inputs, enabling it to understand and interact with graphical user interfaces. With a context length of 128,000 tokens, it can process long conversational histories or detailed instructions. Pricing: $0.10 per million input tokens and $0.20 per million output tokens. The model supports advanced features like frequency penalty, logit bias, max tokens, presence penalty, repetition penalty, seed, stop, and temperature. Built upon the UI-TARS framework with reinforcement learning, it achieves robust performance in automating GUI tasks.

💡Highlights

├─7B parameters
├─128K context length
└─Image + text input, text output

🎯For

├─AI researchers
├─GUI automation developers
└─Product builders

🔗Links

└─OpenRouter Model Page