Lexmount · Open Benchmark

LexBench-Browser

A real-world browser-agent benchmark with long-tail, multilingual web tasks and a deterministic stepwise judge.

Lexmount Research External collaborators welcome

Overview

LexBench is not just another dataset — it is a benchmark series and evaluation infrastructure for real-world Agents. The platform is built around four orthogonal, pluggable axes, so that any layer can be swapped or replaced without touching the rest. Results stay reproducible, comparable and ablation-friendly.

🤖
Agent
Standardized BaseAgent contract. browser-use, skyvern, Agent-TARS, deepbrowse, openai-cua, claude-code — same protocol, fair comparison.
🧠
Model
Any LLM the agent can call. GPT, Claude, Gemini, Doubao, Kimi, Qwen, DeepSeek, MiniMax — switch via config, no code change.
🌐
Browser
Local Chrome, Lexmount cloud browser, AgentBay, CDP — pluggable execution environment with per-task isolation and session-pool support.
⚖️
Eval
Pluggable judge strategies — stepwise rubric (LexJudge), final-answer judge, WebJudge, custom graders. Process-level scoring with structured failure attribution.
Why an evaluation platform, not just a dataset? Most benchmarks ship as "data + scripts" and decay as soon as the dataset is solved. LexBench treats evaluation as shared infrastructure: every layer is decoupled, every run is reproducible, and every result is comparable across Agent × Model × Browser × Eval combinations.

Tasks

The first dataset, LexBench-Browser, focuses on browser-agent tasks that resemble actual user workflows: search, e-commerce, video, social, academic and tool-use across both English and Chinese websites, with varying difficulty tiers.

210
Tasks total
across 107 distinct websites
4
Reference agents
browser-use · deepbrowse · Agent-TARS · skyvern
15
Models evaluated
GPT · Claude · Gemini · Doubao · Kimi · Qwen · DeepSeek · MiniMax
3
Browser backends
Chrome-Local · Lexmount Cloud · AgentBay
Task distribution diagram showing the categorical and difficulty breakdown of LexBench-Browser tasks
Categorical and difficulty distribution across the LexBench-Browser task pool.
Login-gated and operation-tier tasks are coming soon. The next release adds tasks that require account context, multi-step transactions and safety-sensitive flows — to evaluate Agents on the parts of the web that English-only benchmarks miss.

Architecture

The system is organized as four layers — CLI, Execution, Data & Evaluation, and Output — with five pluggable modules concentrated in the middle two layers. Agents, models, browsers, benchmarks and judge strategies are all selected through configuration; the core flow is fixed.

LexBench system architecture: four horizontal layers (CLI, Execution, Data and Evaluation, Output) with pluggable modules for LLM Models, Agents, Browsers, Benchmarks, and Eval LLM Judges
Four-layer architecture: CLI → Execution (Models · Agents · Browsers) → Data & Evaluation (Benchmarks · LLM Judge) → Output (Leaderboard · API).

Evaluation Methodology

Each trajectory is judged stepwise by a calibrated LLM judge. Final tasks pass when the step-aligned score crosses a per-task threshold declared in the dataset. All judge prompts, step screenshots and per-task scores are saved alongside the run for full reproducibility.

Evaluation pipeline diagram: trajectory captured, screenshots collected, stepwise judge model scores each step, final score gated against pass threshold of 60
Stepwise judge pipeline: trajectory → step alignment → judge model → threshold-gated final score.
Strategy
stepwise
Per-step screenshot + intent alignment
Judge
gpt-5.4
Calibrated LLM judge model
Pass threshold
per-task
Declared in the dataset, scaled to task difficulty
Reported
success / total
No silent task drops

Reference Agents

Four open agent integrations ship in the runner today, each registered with a single decorator on a BaseAgent subclass. Adding a new one is roughly ten lines.

browser-use
DOM-action LLM agent driving Playwright with built-in tool primitives.
repo →
deepbrowse
Long-horizon planning + execution split, optimized for multi-step navigation.
private preview
Agent-TARS
Vision-grounded action policy with screenshot-first reasoning.
repo →
skyvern
Workflow-style orchestrated agent with explicit step contracts.
repo →
Implementing a new agent: subclass BaseAgent, register with @register_agent("name"), and the runner will spawn it inside its own uv environment via --extra <agent>.

Community

LexBench-Browser is most useful when teams can reproduce each other's results and compare new agents under the same protocol. The fastest way to contribute is to add a run, an integration, or a benchmark proposal that others can verify.

1
Run a Baseline
Start with bubench run --agent browser-use --data LexBench-Browser --mode first_n --count 3 and share the run directory when reporting results.
2
Add an Agent
Implement a BaseAgent adapter, register it once, and compare it against the same browser backends and judge settings.
3
Propose Tasks
Open an issue with target sites, expected actions, login requirements, and why the task reveals a real browser-agent failure mode.
4
Reproduce Results
Attach config, model ID, browser backend, judge settings, and task-level outputs so leaderboard changes can be audited.

Leaderboard

Live results across all reference agents and models. Click any header to sort. Showing the top 15 entries by success rate.

benchmark · LexBench-Browser judge · gpt-5.4 (per-task threshold) snapshot · 2026-04-29
# Agent Model Browser Success % Avg steps Avg e2e (s)
1 browser-use claude-opus-4-7 Lexmount 58.0 14.2 205.8
2 browser-use kimi-k2.5 Lexmount 58.0 24.7 280.1
3 browser-use gemini-3.1-pro-preview Lexmount 56.0 13.9 149.0
4 browser-use gpt-5.5 Lexmount 54.0 14.1 273.0
5 browser-use gemini-3.1-pro-preview Chrome-Local 54.0 14.0 163.3
6 browser-use kimi-k2.6 Lexmount 52.0 30.3 447.6
7 browser-use bu-2-0 Chrome-Local 48.0 20.8 136.5
8 browser-use MiniMax-M2.7 Chrome-Local 42.0 27.5 413.1
9 Agent-TARS gemini-3.1-pro-preview Lexmount 40.0 18.4 121.1
10 browser-use bu-2-0 Lexmount 40.0 23.4 350.7
11 browser-use gemini-2.5-pro Lexmount 40.0 18.7 279.2
12 browser-use qwen3.5-plus Lexmount 40.0 23.7 326.7
13 browser-use MiniMax-M2.5 Lexmount 38.0 27.4 354.3
14 browser-use MiniMax-M2.7 Lexmount 36.0 22.4 408.9
15 browser-use doubao-seed-2-0-pro Lexmount 36.0 17.3 385.3

Live source: leaderboard server at the team intranet. Data is automatically pulled from experiments/{benchmark}/{split}/{agent}/{model_id}/{ts}/ run dirs.

Cite

If you use LexBench-Browser in your work, please cite:

@misc{lexbench_browser_2026,
  title        = {LexBench-Browser: A Real-World Browser Agent Benchmark with Long-Tail and Multilingual Tasks},
  author       = {Lexmount Research and Collaborators},
  year         = {2026},
  howpublished = {\url{https://lexmount.github.io/browseruse-agent-bench/}},
  note         = {Open benchmark; v1.0 reference release}
}

Acknowledgements: integration scaffolding adapted patterns from browser-use, skyvern, Agent-TARS and deepbrowse upstream codebases.