Web AgentMind2Web #1
The #1 web agent on the hardest
public benchmark.
Mind2Web tests 300 tasks across 136 live websites at three difficulty levels with human evaluation — the most rigorous public eval for web agents. TinyFish scores 89.9%, beating OpenAI Operator (61.3%), Claude Computer Use (56.3%), and Browser Use (30.0%).
Performance
Mind2Web benchmark0.0%
Overall Score
300 tasks, 136 websites
0.0%
Hard Tasks
vs Operator 43.2%
0ms
Page Observation
C++ ABox, zero JS
0
Stealth Mechanisms
C++ level anti-bot
Mind2Web Benchmark
Difficulty Breakdown
| Agent | Easy | Medium | Hard | Drop |
|---|---|---|---|---|
| ★ TinyFish | 97.5% | 89.9% | 81.9% | −15.6pt |
| Operator | 83.1% | 58% | 43.2% | −39.9pt |
| Claude CU | 90.4% | 49% | 32.4% | −58pt |
| BrowserUse | 55.4% | 26.6% | 8.1% | −47.3pt |
15.6pt drop from easy to hard — smallest degradation of any agent
Why TinyFish Wins
1
C++ Observation
Agent.extractABox runs in the rendering engine. 15ms to observe a page vs 200ms for Puppeteer.
2
Invisible to Sites
Custom CDP domain. No Runtime.evaluate, no Network.enable, no script injection. Nothing for anti-bot to detect.
3
Learns Over Time
CaR v2 extracts navigation patterns. Holdout A/B measures causal impact. Recipe replay skips LLM calls.
4
Cost Efficient
20-30% of steps need LLM reasoning. Mechanical actions run in milliseconds via compiled recipes.
Speed & Efficiency
Page Observation Latency
Puppeteer200ms
Runtime.evaluate + DOM
Playwright150ms
JS injection + selectors
Agent.extractABox15ms
Single C++ call
10-13x faster
Cost Architecture
LLM reasoning25%
Complex decisions only
Compiled recipes45%
Zero LLM cost
Mechanical actions30%
Millisecond execution
Effective LLM cost reduction
~75%
vs pure LLM-per-step agents
Architecture
Tool Palette
VisitUrl
ScrollPage
SwitchTab
ClickElement
InputText
SelectOption
HoverElement
DragAndDrop
PressKey
SetRange
FillForm
Screenshot
ContentExtract
ListOptions
InspectImage
InspectPDF
FetchTool
SearchTool
WebSearch
Wait
Reconfigure
■ Navigation■ Action■ Batch■ Observe■ Data■ Utility
Competitive Landscape
| Agent | Mind2Web | Browser | Stealth | Learning |
|---|---|---|---|---|
| ★ TinyFish | 89.9% | Custom Chromium | 28 C++ mechanisms | CaR v2 + recipes |
| OpenAI Operator | 61.3% | Hosted Chrome | Standard | None |
| Claude CU | 56.3% | Screenshot-based | N/A | None |
| Browser Use | 30.0% | Playwright | None | None |
Technology Stack
| Layer | Technology | Role |
|---|---|---|
| Agent Framework | Google ADK v1.21+ | Agent lifecycle, session management, tool dispatch |
| LLM | Google Gemini Flash | Decision-making, tool selection, content understanding |
| Browser | Custom Chromium 147 | C++-level stealth, Agent.* CDP domain, 15ms ABox |
| Gateway | Python aiohttp | Session management, proxy pool, site model recording |
| Web Framework | FastAPI + uvicorn | SSE streaming, REST endpoints |
| Learning | CaR v2 (local JSON) | Pattern extraction, holdout A/B, recipe replay |
| Tracing | LangSmith + OpenTelemetry | Execution tracing, evaluation |
| Storage | SQLite (async) | Step snapshots, session history |
Active Project
In Progress6 Stages~8,600 lines
AgentBrowser Migration
Incrementally migrating ~8,600 lines of new capabilities into production. Covers AgentBrowser driver integration, 9 new tools, CaR v2 learning system, prompt rewrite, and recipe replay — each stage independently testable and reversible.
View Migration Plan →