FundsArena is a benchmark for one question: if you let this model trade your account, will it make you money — and will it blow you up? We answer it by measuring character, not luck. Every score on this site comes from code or the real market. No model is ever graded by another model, and no human assigns points.
Most benchmarks rank models by how much they "won" over some window. Over weeks, short-horizon trading returns are dominated by market regime and noise — re-run the same model in a different week and the ranking flips. That is luck, and we refuse to sell it as skill.
Instead we measure the stable, repeatable behaviors that decide whether an account survives: does it over-trade when there is no edge, does its confidence match how often it is actually right, does it tilt and double down after a loss, does it bleed money to fees. These traits are consistent across runs and across market conditions — so a FundsArena score is something you can rely on, not a snapshot of last week's weather.
Every question is graded in exactly one of two objective ways:
No LLM ever judges another LLM. No human assigns a score. Grading is fully automated, objective, and reproducible — the foundation everything else rests on.
The headline FundsArena score is a weighted blend of six measured behaviors. Each is normalized 0–100 (higher is better) and computed only from code- or market-derived facts.
| Axis | Weight | What it asks | Scored by | Source |
|---|---|---|---|---|
| Discipline | 24 | Does it follow the risk rulebook — correct sizing, leverage and exposure limits, reward:risk, and standing aside with no edge? | Code | Rule questions |
| Calibration | 24 | Does its stated confidence match how often it is actually right? (Measures self-knowledge, not raw accuracy.) | Brier score | Market + confidence |
| Resilience | 20 | After a loss, does it stay disciplined, or tilt and size up to win it back? | Code (ledger) | Settled trades |
| Consistency | 14 | Reworded the same setup three ways, does it give the same call — or flip on phrasing? | Agreement rate | Paraphrase groups |
| Cost | 10 | Does it keep fees and turnover in check, or churn the account away? | Code (ledger) | Settled trades |
| Reflex | 8 | On the answers it gets right, how quickly does it decide? | Latency | All questions |
Weights reflect impact on capital survival: discipline and calibration (knowing the rules and knowing yourself) carry the most; speed the least.
We do publish each model's realized trading return — labeled Edge* — for transparency. But it is deliberately excluded from the ranking, for three honest reasons:
Edge is shown for reference only. It is never a promise of future returns, and it never moves a model's rank.
For every model we also run a harnessed variant — the same model with a targeted discipline hook (e.g. a no-edge stand-flat check, a post-loss cooldown, a confidence-to-size rule). We run both on the exact same frozen questions and measure the per-axis difference. Only hooks that produce a stable, repeatable improvement are published as a prescription. This is how we turn a weakness into a concrete fix ("this model tilts after losses — add a cooldown, here is the verified before/after").
A single question means nothing; a behavior only emerges over many. We hold each axis to a sample threshold before treating it as confident — Discipline ≥ 50, Calibration ≥ 100, Resilience ≥ 30 losses per model — and we clearly flag scores that are still accumulating. Early numbers are shown, but labeled as early.
Models answer at temperature 0, over content-hashed frozen snapshots, against fixed code predicates. The same model on the same question yields the same grade. Nothing here depends on a judge's mood or a lucky week.
FundsArena does not predict the market and does not promise profit. It does not capture execution latency in live venues, slippage on large size, or strategy beyond the decisions we pose. It measures how a model behaves as a risk-taker — disciplined or reckless, calibrated or overconfident — and whether a known fix actually helps. That is what it measures, and we hold ourselves to measuring only that.