Web AgentMind2Web #1

The #1 web agent on the hardest
public benchmark.

Mind2Web tests 300 tasks across 136 live websites at three difficulty levels with human evaluation — the most rigorous public eval for web agents. TinyFish scores 89.9%, beating OpenAI Operator (61.3%), Claude Computer Use (56.3%), and Browser Use (30.0%).

Performance

Mind2Web benchmark
0.0%
Overall Score
300 tasks, 136 websites
0.0%
Hard Tasks
vs Operator 43.2%
0ms
Page Observation
C++ ABox, zero JS
0
Stealth Mechanisms
C++ level anti-bot

Mind2Web Benchmark

Mind2Web overall accuracy (%) — higher is betterTinyFish89.9%OpenAI Operator61.3%Claude Computer Use56.3%Browser Use30%TinyFish: 46% more accurate than Operator

Difficulty Breakdown

AgentEasyMediumHardDrop
TinyFish97.5%89.9%81.9%15.6pt
Operator83.1%58%43.2%39.9pt
Claude CU90.4%49%32.4%58pt
BrowserUse55.4%26.6%8.1%47.3pt
15.6pt drop from easy to hard — smallest degradation of any agent

Why TinyFish Wins

1
C++ Observation
Agent.extractABox runs in the rendering engine. 15ms to observe a page vs 200ms for Puppeteer.
2
Invisible to Sites
Custom CDP domain. No Runtime.evaluate, no Network.enable, no script injection. Nothing for anti-bot to detect.
3
Learns Over Time
CaR v2 extracts navigation patterns. Holdout A/B measures causal impact. Recipe replay skips LLM calls.
4
Cost Efficient
20-30% of steps need LLM reasoning. Mechanical actions run in milliseconds via compiled recipes.

Speed & Efficiency

Page Observation Latency

Puppeteer200ms
Runtime.evaluate + DOM
Playwright150ms
JS injection + selectors
Agent.extractABox15ms
Single C++ call
10-13x faster

Cost Architecture

LLM reasoning25%
Complex decisions only
Compiled recipes45%
Zero LLM cost
Mechanical actions30%
Millisecond execution
Effective LLM cost reduction
~75%
vs pure LLM-per-step agents

Architecture

SSE events + ABox JSONEVB FrontendReactEVA AgentADK + GeminiBrowser DriverCDP WebSocketAgentBrowserGateway :9250WebCaR v2 Learningextract + replay25 Toolsclick, type, hover, ...content_shellChromium + Agent.* CDPRequestResponse

Tool Palette

VisitUrl
ScrollPage
SwitchTab
ClickElement
InputText
SelectOption
HoverElement
DragAndDrop
PressKey
SetRange
FillForm
Screenshot
ContentExtract
ListOptions
InspectImage
InspectPDF
FetchTool
SearchTool
WebSearch
Wait
Reconfigure
Navigation Action Batch Observe Data Utility

Competitive Landscape

AgentMind2WebBrowserStealthLearning
TinyFish89.9%Custom Chromium28 C++ mechanismsCaR v2 + recipes
OpenAI Operator61.3%Hosted ChromeStandardNone
Claude CU56.3%Screenshot-basedN/ANone
Browser Use30.0%PlaywrightNoneNone

Technology Stack

LayerTechnologyRole
Agent FrameworkGoogle ADK v1.21+Agent lifecycle, session management, tool dispatch
LLMGoogle Gemini FlashDecision-making, tool selection, content understanding
BrowserCustom Chromium 147C++-level stealth, Agent.* CDP domain, 15ms ABox
GatewayPython aiohttpSession management, proxy pool, site model recording
Web FrameworkFastAPI + uvicornSSE streaming, REST endpoints
LearningCaR v2 (local JSON)Pattern extraction, holdout A/B, recipe replay
TracingLangSmith + OpenTelemetryExecution tracing, evaluation
StorageSQLite (async)Step snapshots, session history

Active Project

In Progress6 Stages~8,600 lines

AgentBrowser Migration

Incrementally migrating ~8,600 lines of new capabilities into production. Covers AgentBrowser driver integration, 9 new tools, CaR v2 learning system, prompt rewrite, and recipe replay — each stage independently testable and reversible.

View Migration Plan →