ByteDance Models
ByteDance logo

UI-TARS 7B

7B

by ByteDance

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement learning-based reasoning, enabling robust action planning and execution across virtual interfaces. This model achieves state-of-the-art results on a range of interactive and grounding benchmarks, including OSworld, WebVoyager, AndroidWorld, and ScreenSpot. It also demonstrates perfect task completion across diverse Poki games and outperforms prior models in Minecraft agent tasks. UI-TARS-1.5 supports thought decomposition during inference and shows strong scaling across variants, with the 1.5 version notably exceeding the performance of earlier 72B and 7B checkpoints.

Input Price$0.10/1M tokens
Output Price$0.20/1M tokens
Context Window128,000 tokens
Modalitiesimage, text

Specifications

Technical details and pricing.

ProviderByteDance
Context Window128,000 tokens
Release DateJul 22, 2025
ModalitiesImage, Text β†’ Text
CapabilitiesVision

Frequently Asked Questions

What is UI-TARS 7B good for?

Use UI-TARS 7B for everyday tasks like writing, summarizing, brainstorming, and getting clear explanations.

How much does UI-TARS 7B cost?

Pricing is based on usage. Current rates are $0.10/1M tokens for input and $0.20/1M tokens for output.

Can I try UI-TARS 7B for free?

Yes. You can start a chat instantly and test the model before deciding on a plan.

Does UI-TARS 7B support images or audio?

UI-TARS 7B can understand images.

Pricing, context, and capability data are sourced from OpenRouter.