ByteDance: UI-TARS 7B
🧠 AI Modelbytedance
Multimodal vision-language agent for GUI automation, optimized for desktop, web, mobile, and games.
UI-TARS-1.5 (7B parameters) is a state-of-the-art multimodal vision-language agent developed by ByteDance. It excels in GUI-based environments, including desktop applications, web browsing, mobile interfaces, and gaming. The model accepts both image and text inputs, enabling it to understand and interact with graphical user interfaces. With a context length of 128,000 tokens, it can process long conversational histories or detailed instructions. Pricing: $0.10 per million input tokens and $0.20 per million output tokens. The model supports advanced features like frequency penalty, logit bias, max tokens, presence penalty, repetition penalty, seed, stop, and temperature. Built upon the UI-TARS framework with reinforcement learning, it achieves robust performance in automating GUI tasks.
💡Highlights
- ├─7B parameters
- ├─128K context length
- └─Image + text input, text output
🎯For
- ├─AI researchers
- ├─GUI automation developers
- └─Product builders