Paper Detail

CFAgentBench: A Reproducible Environment and Benchmark for Autonomous Construction-Finance Agents

Rishi Srivastava

arxiv Score 14.9

Published 2026-06-20 · First seen 2026-06-24

Research Track B · General AI

Abstract

We introduce CFAgentBench, a reproducible, self-hostable environment and benchmark for autonomous construction-finance agents: a CFO/controller-class agent operating across the real software stack a US construction finance team runs - ERP, project management, email, documents, pay applications, payroll, certified payroll, lien waivers, and bank/treasury portals. It contains 1,014 machine-gradeable task specifications across 8 domains and 77 families, every family grounded in a real source; a self-validated subset of 40 tasks (54 with a project-management extension) is compiled into oracle-validated executable evaluators, the runnable suite reported here. Following WebArena, the benchmark runs on an executable environment rather than static traces: 35 mock applications (31 reconciled to one company book, plus 4 PM platforms) over 9 archetypes, each implementing a uniform self-hostable app contract, so every task is graded by functional correctness - a state diff plus forbidden-side-effect checks plus required-output regexes - with an LLM judge used only for reply quality, never as reward. A distinguishing principle is a money-movement guard: 278 instances embed a payment, payroll, e-signature, or e-filing step where the correct behavior is to stop and stage for human approval, and executing even the correct transaction fails the task. The public split (n=711) is sized for a 95% Wilson half-width of +/-4.1%; a private, contamination-protected split (n=303) is reserved for remote scoring. In a first three-model open-weight sweep (k=5), the strongest agent reaches pass^1 = 0.67 but only pass^5 = 0.38 - losing 43% of its successes when required to repeat them under temperature-0 decoding. The within-model pass^1 to pass^5 collapse and sharp per-domain heterogeneity are clear evidence that single-attempt accuracy overstates deployable construction-finance competence.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{srivastava2026cfagentbench,
  title = {CFAgentBench: A Reproducible Environment and Benchmark for Autonomous Construction-Finance Agents},
  author = {Rishi Srivastava},
  year = {2026},
  abstract = {We introduce CFAgentBench, a reproducible, self-hostable environment and benchmark for autonomous construction-finance agents: a CFO/controller-class agent operating across the real software stack a US construction finance team runs - ERP, project management, email, documents, pay applications, payroll, certified payroll, lien waivers, and bank/treasury portals. It contains 1,014 machine-gradeable task specifications across 8 domains and 77 families, every family grounded in a real source; a sel},
  url = {https://arxiv.org/abs/2606.22000},
  keywords = {cs.AI, cs.CL, cs.LG, cs.SE},
  eprint = {2606.22000},
  archiveprefix = {arXiv},
}

Metadata

{}