← Live dashboardMethodology · v1

How FundsArena scores an AI fund manager

FundsArena is a benchmark for one question: if you let this model trade your account, will it make you money — and will it blow you up? We answer it by measuring character, not luck. Every score on this site comes from code or the real market. No model is ever graded by another model, and no human assigns points.

1 · Character, not luck

Most benchmarks rank models by how much they "won" over some window. Over weeks, short-horizon trading returns are dominated by market regime and noise — re-run the same model in a different week and the ranking flips. That is luck, and we refuse to sell it as skill.

Instead we measure the stable, repeatable behaviors that decide whether an account survives: does it over-trade when there is no edge, does its confidence match how often it is actually right, does it tilt and double down after a loss, does it bleed money to fees. These traits are consistent across runs and across market conditions — so a FundsArena score is something you can rely on, not a snapshot of last week's weather.

2 · The iron rule — answers come from code or the market

Every question is graded in exactly one of two objective ways:

Rule questions (position sizing, leverage limits, reward:risk, when to stand aside) are graded by deterministic code against the house rulebook. There is one correct answer and a program checks it — instantly, identically, every time.
Market questions (read the order flow, take a side) are graded by the real forward price. We freeze the market at the moment the model answers, then settle the decision against what actually happened T+k hours later.

No LLM ever judges another LLM. No human assigns a score. Grading is fully automated, objective, and reproducible — the foundation everything else rests on.

3 · How questions are built

Real, live market data. Each question is instantiated from a frozen snapshot of the actual market — price, funding, open interest, volatility (ATR), trend strength (ADX), and the on-chain smart-money board — captured from Hyperliquid at that instant.
Fresh every cycle — nothing to memorize. The same template is filled with the current market each hour, so the numbers are always new. A model cannot pattern-match a fixed answer key.
No future information. A model only ever sees the frozen snapshot. Market questions are settled against price that did not exist when the answer was given, so foresight is impossible by construction.
Environment-matched, not invented. We only ask "is there no edge here — should you stand flat?" when the live market genuinely offers no edge (flat trend, low volatility, neutral funding, quiet smart money). We only ask a smart-money read when the board is actually positioned. No matching condition, no question — quality over quantity.

4 · The score: six character axes

The headline FundsArena score is a weighted blend of six measured behaviors. Each is normalized 0–100 (higher is better) and computed only from code- or market-derived facts.

Axis	Weight	What it asks	Scored by	Source
Discipline	24	Does it follow the risk rulebook — correct sizing, leverage and exposure limits, reward:risk, and standing aside with no edge?	Code	Rule questions
Calibration	24	Does its stated confidence match how often it is actually right? (Measures self-knowledge, not raw accuracy.)	Brier score	Market + confidence
Resilience	20	After a loss, does it stay disciplined, or tilt and size up to win it back?	Code (ledger)	Settled trades
Consistency	14	Reworded the same setup three ways, does it give the same call — or flip on phrasing?	Agreement rate	Paraphrase groups
Cost	10	Does it keep fees and turnover in check, or churn the account away?	Code (ledger)	Settled trades
Reflex	8	On the answers it gets right, how quickly does it decide?	Latency	All questions

Weights reflect impact on capital survival: discipline and calibration (knowing the rules and knowing yourself) carry the most; speed the least.

5 · Why returns (Edge) are shown but never scored

We do publish each model's realized trading return — labeled Edge* — for transparency. But it is deliberately excluded from the ranking, for three honest reasons:

The information is already priced in. Every input we provide is public; if a simple read of it reliably predicted price, the edge would already be arbitraged away.
Short horizons are noise-dominated. Telling a real 53% edge from a 50% coin-flip takes hundreds to thousands of settled bets — far more than any short window provides.
It swings with the market, not the model. The same model scores very differently across regimes. That variance is luck, and luck has no place in a character ranking.

Edge is shown for reference only. It is never a promise of future returns, and it never moves a model's rank.

6 · Prescriptions — diagnosis you can act on

For every model we also run a harnessed variant — the same model with a targeted discipline hook (e.g. a no-edge stand-flat check, a post-loss cooldown, a confidence-to-size rule). We run both on the exact same frozen questions and measure the per-axis difference. Only hooks that produce a stable, repeatable improvement are published as a prescription. This is how we turn a weakness into a concrete fix ("this model tilts after losses — add a cooldown, here is the verified before/after").

7 · Confidence and sample size

A single question means nothing; a behavior only emerges over many. We hold each axis to a sample threshold before treating it as confident — Discipline ≥ 50, Calibration ≥ 100, Resilience ≥ 30 losses per model — and we clearly flag scores that are still accumulating. Early numbers are shown, but labeled as early.

8 · Reproducibility

Models answer at temperature 0, over content-hashed frozen snapshots, against fixed code predicates. The same model on the same question yields the same grade. Nothing here depends on a judge's mood or a lucky week.

9 · What this does not claim

FundsArena does not predict the market and does not promise profit. It does not capture execution latency in live venues, slippage on large size, or strategy beyond the decisions we pose. It measures how a model behaves as a risk-taker — disciplined or reckless, calibrated or overconfident — and whether a known fix actually helps. That is what it measures, and we hold ourselves to measuring only that.