Browser Automation's Quiet Fork: Vision vs. Text

Listen to this post (narrated by AI)

If you've been following AI-powered browser automation, you've probably seen the demos. An AI agent looks at a screenshot of a webpage, reasons about what it sees, and clicks buttons like a person would. OpenAI's Operator and Anthropic's Computer Use both work this way, using vision models to interpret the screen and navigate accordingly. It's intuitive and impressive to watch.

But there's another approach gaining traction that works quite differently, and it's worth understanding why.

Two different bets

An open-source project called Browser Use, backed by Y Combinator, takes the opposite approach from vision-based tools. Instead of feeding screenshots to a model, it extracts the page's DOM (the underlying structure of HTML elements, text, and links) and gives that to the AI as text. The model navigates by understanding the page's structure rather than its visual appearance.

On the surface, this seems like a step backward. We've spent years building increasingly capable vision models, and now someone is ignoring the pixels entirely? But on the WebVoyager benchmark, which measures how well agents can complete real browser tasks, Browser Use actually outperforms Operator. The project has picked up over 75,000 GitHub stars, which suggests the developer community sees something here too.

Why the simpler approach is competitive

The advantage of text-based navigation comes down to three practical concerns that compound at scale.

The first is speed. Processing a screenshot through a vision model takes real time, often multiple seconds per action. For a task that requires dozens of interactions, that adds up quickly. Extracting DOM elements as text is nearly instant by comparison, and the language model processing that text is faster than interpreting an image.

The second is reliability. Web pages are visually messy. Elements overlap, layouts shift between screen sizes, shadow DOM components hide content, and what you see on screen doesn't always match what's actually in the page structure. Text extraction is more predictable because you get the elements that are actually there, regardless of how they happen to render visually.

The third is cost. Vision API calls are significantly more expensive than text-based ones. If you're running automation at any real scale (say, processing hundreds of tasks per day) the inference costs for screenshot-based approaches add up fast.

The tradeoff is real, though. Text-based approaches struggle with canvas-heavy interfaces, complex visual layouts where spatial positioning matters, and anything that doesn't expose clean DOM elements. But for the kind of structured web interactions that make up most enterprise automation (filling out forms, navigating dashboards, extracting data from tables) the DOM gives you everything you need.

What this tells us more broadly

This dynamic isn't unique to browser automation. Across AI tooling right now, there's a recurring pattern where the more "advanced" or human-like approach turns out to be less practical than a simpler one for the majority of real use cases.

Voice AI is seeing a similar split, with on-device processing gaining ground over cloud-based approaches for latency-sensitive applications. Agent frameworks are trending toward composable, modular designs rather than monolithic systems. In each case, the boring-but-practical solution is proving more deployable than the impressive-but-complex one.

This doesn't mean vision-based automation is wrong or going away. For tasks that genuinely require visual understanding (CAPTCHAs, visual verification, interfaces that don't expose clean DOM) it's the right tool. But the Browser Use benchmark results suggest that for the majority of real-world browser automation, you don't need the model to "see" anything. You just need it to understand the page structure and act on it.

If you're evaluating browser automation tools, the practical recommendation is probably to start text-based and add vision capabilities only where you actually need them. Measure speed, cost, and success rate across your specific workflows rather than assuming the more sophisticated approach will perform better. In this corner of AI tooling, at least, the less glamorous option is winning on the metrics that matter.