Flex Labs: Senior Underwriting Agent

Evaluating frontier models on core B2B loan underwriting tasks.

Real world financial data by experts from Citi, JP Morgan, Barclays, US Bank, and more.
Flex’s ongoing underwriting history includes: ~86,000 loan applications representing ~$2B in requested credit limits.

Get in touch
Senior Underwriting Agent (150 Sample Tasks)

Senior Underwriting Agent (150 Sample Tasks)

T1/T2 overall · T3 key risks · T4 narrative

1
CL
Claude Opus 4.7T3 3.51 · T4 3.17
55.4%
2
GF
Gemini 3.5 FlashT3 2.8 · T4 2.53
51.6%
3
GP
GPT-5.5T3 3.25 · T4 3.04
50.7%
4
CA
Command A+T3 2.51 · T4 2.35
46.7%
5
GK
Grok 4.3T3 2.82 · T4 2.55
41.5%
6
DS
DeepSeek V4 ProT3 2.93 · T4 2.42
37.6%

Key Takeaways

01

Claude Opus 4.7 is the top performer on Senior Underwriting Agent, scoring 55.4% overall accuracy.

02

Frontier models systemically under-approve in US B2B loan underwriting.

03

The hardest task is not identifying bad credit: it is approving creditworthy edge cases that require judgment beyond policy compliance.

Accuracy by Case Type

(avg across all models)

Policy decline: FICO n=4197.1%
Distressed financials n=3389.1%
Micro business n=3682.2%
Young business n=3569.7%
Standard decision n=2265.5%
High revenue low income n=1353.8%
High revenue declined n=1145.5%
Distressed, approved n=1440%
FICO override ← n=1010%

About Flex Labs

Flex originates real, continuously growing stream of US B2B credit decisions at scale (>$4B in transaction volume). Flex Labs turns that proprietary decision history into the industry's most grounded benchmark for AI underwriting.

Sample task (1 out of 150)

FICO Override · FLEX_016

You are a senior credit underwriter at Flexbase. Base your decision on the information provided, applying sound underwriting judgment and general business knowledge where relevant. Treat null or missing values as data gaps that increase uncertainty, not as neutral or favorable signals.

The following application packet includes: the full Flex credit policy, a precomputed FICO policy check, and structured application signals across 6 data sections. Evaluate independently with no shared context from prior cases.

Based on the application below, provide:
T1: Your decision — "approved" or "declined"
T2: If approved, recommended credit limit (USD)
T3: Top 3 risk factors that most influenced your decision
T4: 2–3 sentence underwriting narrative

--- APPLICATION PACKET: FLEX_016 ---

S1 — Business Info:
  Industry: Online Retail / E-Commerce (Consumer Durables & Apparel)
  State: FL | Business age: 58 months | Stage: Small business
  Owner count: 1 (100% ownership) | KYB: Approved

S2 — Financials:
  TTM Revenue: $4,040,347
  TTM Net Income: $3,111,847
  TTM Gross Profit: $1,857,779
  Yearly revenue trend: $517K → $1.94M → $4.06M (strong growth)
  Yearly gross margin trend: 40.9% → 43.9% → 46.6% (expanding)
  Total assets: $644,798 | Total liabilities: $1,189,297
  Total equity: -$544,499 (negative)
  Cash at decision: $182,234
  Current ratio: 0.55 | Cash ratio: 0.20
  Working capital: -$407,481

S3 — Banking:
  Plaid avg 60d balance: $145,011
  Plaid current balance: $76,325
  Rutter avg 60d balance: $87,000
  Monthly revenue (last 6): $62K, $268K, $227K, $314K, $1,165K, $318K
  Monthly net income (last 6): $69K, $231K, $170K, $187K, $826K, $256K

S4 — Credit:
  FICO: 693 [POLICY CHECK: 650 minimum for $3M-$5M ARR tier — MEETS MINIMUM]
  Total hard pulls: 0 | Hard pulls last 12mo: 0
  Total tradelines: 16 | Open: 12 | Revolving: 11 | Installment: 1
  Issues count: 0
  Bureau reasons: High revolving utilization; number of accounts with delinquency;
    length of time revolving established; proportion of loan balances too high
  Frozen: No

S5 — Fraud / KYC:
  Sardine customer score: 1 | Customer level: Low
  Phone: Low | Bot: Low | Device: Low | Address: Low (valid) | IP: US
  Sardine rules fired: 9 | KYC persons verified: 1
  KYC documents uploaded: 0

S6 — Request:
  Requested limit: $80,000 | Tier: Tier 1 | Signed PG: No

This sample task is provided for illustration only. Domain scores represent the average across 150 held-out anonymized cases. Sample tasks are passed to models with the full Flex credit policy document.