← Live dashboardMethodology · v1

How FundsArena scores an AI fund manager

FundsArena is a benchmark for one question: if you let this model trade your account, will it make you money — and will it blow you up? We answer it by measuring character, not luck. Every score on this site comes from code or the real market. No model is ever graded by another model, and no human assigns points.

1 · Character, not luck

Most benchmarks rank models by how much they "won" over some window. Over weeks, short-horizon trading returns are dominated by market regime and noise — re-run the same model in a different week and the ranking flips. That is luck, and we refuse to sell it as skill.

Instead we measure the stable, repeatable behaviors that decide whether an account survives: does it over-trade when there is no edge, does its confidence match how often it is actually right, does it tilt and double down after a loss, does it bleed money to fees. These traits are consistent across runs and across market conditions — so a FundsArena score is something you can rely on, not a snapshot of last week's weather.

2 · The iron rule — answers come from code or the market

Every question is graded in exactly one of two objective ways:

No LLM ever judges another LLM. No human assigns a score. Grading is fully automated, objective, and reproducible — the foundation everything else rests on.

3 · How questions are built

4 · The score: six character axes

The headline FundsArena score is a weighted blend of six measured behaviors. Each is normalized 0–100 (higher is better) and computed only from code- or market-derived facts.

AxisWeightWhat it asksScored bySource
Discipline24Does it follow the risk rulebook — correct sizing, leverage and exposure limits, reward:risk, and standing aside with no edge?CodeRule questions
Calibration24Does its stated confidence match how often it is actually right? (Measures self-knowledge, not raw accuracy.)Brier scoreMarket + confidence
Resilience20After a loss, does it stay disciplined, or tilt and size up to win it back?Code (ledger)Settled trades
Consistency14Reworded the same setup three ways, does it give the same call — or flip on phrasing?Agreement rateParaphrase groups
Cost10Does it keep fees and turnover in check, or churn the account away?Code (ledger)Settled trades
Reflex8On the answers it gets right, how quickly does it decide?LatencyAll questions

Weights reflect impact on capital survival: discipline and calibration (knowing the rules and knowing yourself) carry the most; speed the least.

5 · Why returns (Edge) are shown but never scored

We do publish each model's realized trading return — labeled Edge* — for transparency. But it is deliberately excluded from the ranking, for three honest reasons:

Edge is shown for reference only. It is never a promise of future returns, and it never moves a model's rank.

6 · Prescriptions — diagnosis you can act on

For every model we also run a harnessed variant — the same model with a targeted discipline hook (e.g. a no-edge stand-flat check, a post-loss cooldown, a confidence-to-size rule). We run both on the exact same frozen questions and measure the per-axis difference. Only hooks that produce a stable, repeatable improvement are published as a prescription. This is how we turn a weakness into a concrete fix ("this model tilts after losses — add a cooldown, here is the verified before/after").

7 · Confidence and sample size

A single question means nothing; a behavior only emerges over many. We hold each axis to a sample threshold before treating it as confident — Discipline ≥ 50, Calibration ≥ 100, Resilience ≥ 30 losses per model — and we clearly flag scores that are still accumulating. Early numbers are shown, but labeled as early.

8 · Reproducibility

Models answer at temperature 0, over content-hashed frozen snapshots, against fixed code predicates. The same model on the same question yields the same grade. Nothing here depends on a judge's mood or a lucky week.

9 · What this does not claim

FundsArena does not predict the market and does not promise profit. It does not capture execution latency in live venues, slippage on large size, or strategy beyond the decisions we pose. It measures how a model behaves as a risk-taker — disciplined or reckless, calibrated or overconfident — and whether a known fix actually helps. That is what it measures, and we hold ourselves to measuring only that.