Flex Labs: Senior Underwriting Agent

Evaluating frontier models on finance workflows, starting with lending.

Real world financial data by experts from Citi, JP Morgan, Barclays, US Bank, and more.
Flex’s ongoing lending history includes: ~86,000 loan applications representing ~$2B in requested credit limits.

Get in touch

Senior Underwriting Agent (150 Sample Tasks)

T1/T2 overall · T3 key risks · T4 narrative

Claude Opus 4.7T3 3.51 · T4 3.17

55.4%

Gemini 3.5 FlashT3 2.8 · T4 2.53

51.6%

GPT-5.5T3 3.25 · T4 3.04

50.7%

Command A+T3 2.51 · T4 2.35

46.7%

Grok 4.3T3 2.82 · T4 2.55

41.5%

DeepSeek V4 ProT3 2.93 · T4 2.42

37.6%

Key Takeaways

Claude Opus 4.7 is the top performer on Senior Underwriting Agent, scoring 55.4% overall accuracy.

Frontier models systemically under-approve in US B2B loan underwriting.

The hardest task is not identifying bad credit: it is approving creditworthy edge cases that require judgment beyond policy compliance.

Every frontier model is biased toward declining. No model over-approves. The gap between human approvals (73) and model approvals ranges from 11 (Claude) to 59 (DeepSeek).

Background

Credit is at the foundation of the US financial system. Small and medium-sized companies contribute to 43.5% of US GDP (~$13.8 trillion in aggregate) and much of that volume is financed via some form of credit rail: credit cards, SBA loans, merchant cash advances, etc.

We've partnered with one of the fastest-growing B2B underwriters in the US, Flex, to understand how their team of dozens of human experts decides how to lend (or not lend) from their $200M facility every day.

What's in the Benchmark?

Avg Accuracy

Policy decline: FICO n=41

FICO below policy threshold; automatic disqualifier

97.6%

Distressed financials n=33

Negative equity, sub-1 current ratio, or persistent losses

91.4%

Micro business n=36

Very small businesses with limited financial history

82.8%

Young business n=35

Early-stage operations under 2 years old

68.4%

Standard decision n=22

Clean applications where policy and financials align

62.1%

High revenue low income n=13

Large revenue with thin or negative net income

52.6%

High revenue declined n=11

Large revenue declined due to leverage or cash flow concerns

54.5%

Distressed, approved n=14

Distressed business approved on compensating factors

34.5%

FICO override n=10

Below-threshold FICO approved by human judgment

8.5%

About Flex Labs

Flex Labs builds evals, expert-labeled datasets, and post-training data for financial reasoning.
We start with B2B credit underwriting because of our evergreen access to experts and measurable downstream outcomes (>$2B in transaction volume) that turns that proprietary decision history into the industry's most grounded benchmark for AI underwriting.

Methodology

Each of the 150 cases was evaluated independently by 6 frontier models using the same structured application packet: the full Flex credit policy, a pre-computed FICO check, and signals across 6 data sections covering business profile, financials, banking, credit, fraud, and the credit request.

Models are scored on four tasks: T1 (approve/decline), T2 (credit limit within 25% of human decision), T3 (top 3 key risks, 1-5 rubric), and T4 (underwriting narrative, 1-5 rubric). T1/T2 are scored deterministically. T3/T4 are scored by a Claude Sonnet 4.6 judge calibrated against real Flex underwriter narratives. Overall score weights T1 at 60% and T2 at 40%.

From Benchmark to Environment

The current Senior Underwriting Agent benchmark uses clean, normalized packets to isolate underwriting judgment. For agentic post-training, Flex Labs can also support messy workflow environments where models inspect original-style source artifacts including financial statements, screenshots, spreadsheet exports, and internal rationale logs, using controlled tools inside a sandboxed environment.

Future versions extend into stateful credit workflows with downstream repayment, delinquency, and recovery outcomes as reward signals, drawn from Flex's live lending portfolio.

Sample task (1 out of 150)

FICO Override · FLEX_016

You are a senior credit underwriter at Flexbase. Base your decision on the information provided, applying sound underwriting judgment and general business knowledge where relevant. Treat null or missing values as data gaps that increase uncertainty, not as neutral or favorable signals.

The following application packet includes: the full Flex credit policy, a precomputed FICO policy check, and structured application signals across 6 data sections. Evaluate independently with no shared context from prior cases.

Based on the application below, provide:
T1: Your decision — "approved" or "declined"
T2: If approved, recommended credit limit (USD)
T3: Top 3 risk factors that most influenced your decision
T4: 2–3 sentence underwriting narrative

--- APPLICATION PACKET: FLEX_016 ---

S1 — Business Info:
  Industry: Online Retail / E-Commerce (Consumer Durables & Apparel)
  State: FL | Business age: 58 months | Stage: Small business
  Owner count: 1 (100% ownership) | KYB: Approved

S2 — Financials:
  TTM Revenue: $4,040,347
  TTM Net Income: $3,111,847
  TTM Gross Profit: $1,857,779
  Yearly revenue trend: $517K → $1.94M → $4.06M (strong growth)
  Yearly gross margin trend: 40.9% → 43.9% → 46.6% (expanding)
  Total assets: $644,798 | Total liabilities: $1,189,297
  Total equity: -$544,499 (negative)
  Cash at decision: $182,234
  Current ratio: 0.55 | Cash ratio: 0.20
  Working capital: -$407,481

S3 — Banking:
  Plaid avg 60d balance: $145,011
  Plaid current balance: $76,325
  Rutter avg 60d balance: $87,000
  Monthly revenue (last 6): $62K, $268K, $227K, $314K, $1,165K, $318K
  Monthly net income (last 6): $69K, $231K, $170K, $187K, $826K, $256K

S4 — Credit:
  FICO: 693 [POLICY CHECK: 650 minimum for $3M-$5M ARR tier — MEETS MINIMUM]
  Total hard pulls: 0 | Hard pulls last 12mo: 0
  Total tradelines: 16 | Open: 12 | Revolving: 11 | Installment: 1
  Issues count: 0
  Bureau reasons: High revolving utilization; number of accounts with delinquency;
    length of time revolving established; proportion of loan balances too high
  Frozen: No

S5 — Fraud / KYC:
  Sardine customer score: 1 | Customer level: Low
  Phone: Low | Bot: Low | Device: Low | Address: Low (valid) | IP: US
  Sardine rules fired: 9 | KYC persons verified: 1
  KYC documents uploaded: 0

S6 — Request:
  Requested limit: $80,000 | Tier: Tier 1 | Signed PG: No

This sample task is provided for illustration only. Domain scores represent the average across 150 held-out anonymized cases. Sample tasks are passed to models with the full Flex credit policy document.