Most AI agents today browse the web the same way you browse it with JavaScript disabled: they fetch HTML, parse it, and hope the structure hasn't changed. They can't click buttons that require JavaScript. They can't handle cookie banners. They definitely can't log into your bank account or fill out dynamic forms.
vscreen changes that. It's a tool that gives AI agents a real Chromium browser, streamed live over WebRTC.
The Problem: Agents Can't See Half the Internet
Traditional web scraping assumes the internet is a static document. But the modern web is an application platform. Login forms use JavaScript to validate input. Single-page apps render everything client-side. CAPTCHAs block bots. Cookie consent overlays cover the actual content.
When an agent tries to navigate to a SaaS dashboard or a news site, it's basically blindfolded. It gets HTML, but the interesting stuffâthe interactive elements, the dynamic content, the actual functionalityâlives in JavaScript execution.
vscreen solves this by giving agents an actual browser to control.
How It Works: WebRTC Browser Streaming
vscreen runs a headless Chromium browser and streams the visual output to the AI agent via WebRTC. The agent sees what a human would see: rendered pages, interactive elements, loading states, errors. It can click, scroll, type, and wait just like a user.
But simply giving an agent a browser isn't enough. The first release had 63 individual toolsâclick, type, wait, scroll, screenshot, extract, navigate, and so on. Agents ended up chaining them inefficiently: three round-trips just to click a button and see what happened.
The 0.2.0 Architecture: Two Layers, One Fast Path
Version 0.2.0 consolidates those 63 tools into 47 using a two-layer architecture:
Layer 1: Workflow tools handle entire workflows in a single call:
browseâ navigates, waits, dismisses cookie banners, screenshots, returns page infointeractâ clicks by visible text, returns a screenshot afterextractâ pulls structured data in six modes: articles, table, kv, stats, links, or auto
Layer 2: Precision tools give exact control when needed:
click,type,find,wait,scroll, etc.
The insight here is that 80% of web tasks can be handled by Layer 1 tools. Click-and-wait becomes one call instead of four. Extracting an article becomes one call instead of parsing HTML for 20 minutes.
Drop to Layer 2 only when you need surgical precision.
The Live Advisor: Catching Mistakes in Real-Time
The most interesting feature is the advisor. The MCP server tracks every tool call in a sliding window and returns inline hints when it detects anti-patterns:
| Pattern Detected | Hint |
|-----------------|------|
| click â wait â screenshot | vscreen_interact does this in one call |
| scroll â screenshot loop | Use full_page=true instead |
| Repeated fixed waits | Use condition="text" or condition="selector" |
| 5+ calls without Layer 1 | Try vscreen_browse or vscreen_interact |
This is observability built into the tool itself. Instead of the agent failing silently, the tool tells it "you're doing this inefficiently, here's a better way."
Synthesis: Building Websites from Scraped Data
The synthesis feature is wild. One call scrapes multiple URLs in parallel, extracts content, and builds a live web page:
vscreen_synthesis_scrape_and_create({
"instance_id": "dev",
"title": "Tech News Roundup",
"urls": [
{"url": "https://arstechnica.com", "limit": 8, "source_label": "Ars"},
{"url": "https://techcrunch.com", "limit": 8, "source_label": "TC"},
{"url": "https://theverge.com", "limit": 8, "source_label": "Verge"}
]
})
Three ephemeral tabs open in parallel. The page builds live via SSE as each source finishes. Component type auto-selects: 1â3 sources â hero, 4â12 â card grid, 13+ â content list.
It uses 31 Svelte 5 components: card grids, sortable tables, charts, timelines, code blocks, image galleries. The scraper runs 5 different strategiesâJSON-LD, <article> detection, heading+link heuristics, card detection, OpenGraphâwith ad filtering and quality scoring.
Why This Matters for Agent Architecture
vscreen represents a shift in how we think about agent tooling:
-
From APIs to browsers: Instead of building custom integrations for every service, give agents a browser and let them use it like a human would.
-
Tool consolidation: 63 tools was too many. 47 with a two-layer hierarchy is manageable. The advisor patternâwhere tools tell agents they're being inefficientâis a form of runtime guidance that could apply elsewhere.
-
Synthesis as a first-class operation: Not just extracting data, but producing useful output. The agent doesn't just browse the webâit builds something from what it finds.
The practical implications are significant. An agent that can reliably navigate any website, handle CAPTCHAs, extract structured data, and produce synthesized output is fundamentally more capable than one limited to API calls and static HTML parsing.
This is what "agentic" actually looks like in practice: not just choosing tools, but having a tool that can interact with the world the way a human does.
Found via This Week in Rust issue 641. The vscreen project is on GitHub at github.com/jameswebb68/vscreen.