Fixing LLM Benchmarking for Tax, Finance & Accounting
Why current LLM evaluation methods fail for professional services — and how we're building a better framework.
Work in Progress
The Problem
Standard LLM benchmarks like MMLU, HumanEval, and HellaSwag measure general intelligence — but they completely miss what matters for tax, finance, and accounting professionals.
Temporal accuracy — distinguishing 2023 vs 2024 tax code changes
Professional judgment — when to escalate vs automate
Current benchmarks don't test any of this. A model can score 90% on MMLU but still fail to correctly calculate depreciation or identify a Sec 179 deduction.
Our Approach
We're building TaxBench, FinBench, and AcctBench — domain-specific evaluation frameworks that test what actually matters. We benchmark top SOTA models (GPT-4, Claude 3.5, Gemini 1.5) and leading open-source models against rigorous professional standards:
1. Real-World Task Simulation
Instead of multiple-choice questions, we evaluate models on actual workflows:
"Calculate AMT for a taxpayer with ISO stock options and rental income"
"Identify missing Schedule C deductions from bank statements"
"Draft a response to IRS Notice CP2000 with supporting documentation"
2. Citation & Source Verification
Accuracy isn't enough; models must prove their work. We test for:
Correct citation of specific IRC sections and subsections
Reference to relevant case law and revenue rulings
Ability to distinguish between primary authority and guidance
Without domain-specific benchmarks, we're flying blind. Companies buy "AI tax software" based on generic metrics that don't predict real-world performance.
This creates:
Wasted investment — firms buy tools that fail in production
Compliance risk — AI makes mistakes that trigger audits
Lost trust — professionals abandon AI after bad experiences
We're building TaxBench/FinBench/AcctBench to change this — giving practitioners a way to evaluate models on what actually matters before deploying them.
Current Status
We are compiling one of the most comprehensive datasets for LLM benchmarking in tax, finance, and accounting. Unlike academic exams, our dataset encompasses a vast array of real-world factors, edge cases, and complex regulatory nuances.
📈 AcctBench v1 — Deep GAAP compliance and audit support workflows
Expected Release: Q1 2026
We will publish results for major SOTA models and leading open-source alternatives. Concurrently, we are developing specialized infrastructure and domain-adapted models designed to set the new standard for accuracy and reliability in the industry.
Want to Contribute?
We're looking for CPAs, tax attorneys, and accounting professionals to help validate our benchmark scenarios.