T³ Atlas · benchmark library

Aggregated lm-eval-harness results across the T³ architectural lineage and canonical baselines. Each row shows the best checkpoint per (model, task); hover any model name for the lineage / methodology context note.

Headline — T³ vs same-data baseline at 124M params
computing…
All comparisons are against gpt2 / 124M_vanilla_5b_same_data — the only baseline in the library trained on the identical 5B-token data mix as the T³ rows. Cross-corpus comparisons (Qwen, SmolLM) shown in the panels but excluded from headline ratios. Compute = params × training_tokens.
methodology — what we are and aren't claiming

Training data mix (T³ runs): "Ultimate Mix 5B" — fineweb_edu 40%, dclm 20%, stack_edu 10%, finemath 10%, cosmopedia 10%, wikipedia 10% (Qwen substrate adds wiki). Total budget: 5B tokens unless otherwise tagged.

Validation perplexity (val_ppl column below) is on WikiText-103, which is held out of the training mix. It's an OOD generalization measure, surfaced as provenance context — to identify the checkpoint and confirm the run completed — not as a headline.

We are not claiming best-in-class perplexity at any scale. PPL competitiveness on a single OOD corpus is not what T³ is for, and there is direct empirical evidence in the library that PPL and downstream capability decouple: in the v3.7+ Phase 1A σ-MLP width sweep (sh16 / sh32 / sh64), val PPL improved monotonically with σ width while downstream benchmark accuracy did not track — wider σ produced lower PPL but the same or worse benchmark scores. (Cortex memory 634d24dfb0822b05, 2026-04-30.)

The headline claim is parameter-and-compute efficiency on capability benchmarks against a same-data baseline. PPL is shown for reproducibility, not for ranking.

epistemic tiers — what the badges mean
verified ckpt re-evaluated 2026-05-01+, training log parsed, no known bugs, attribution independently confirmed
probable provenance sound, no known bugs, no independent re-eval performed
caveated provenance sound BUT a known bug or limitation affects interpretation; cite with caveat
suspect provenance unclear or attribution mismatch; do not cite without re-verification
archival known broken or pre-release; lineage continuity only, not for benchmark interpretation
Hover any badge to see the row's specific caveat tags.

Parameter efficiency — one panel per task

log-x: parameter count · y: accuracy · larger dots = more recent T³ versions. Cells where T³ at smaller scale matches or beats baselines at larger scale are the headline result.

Compute frontier — score vs params × tokens

log-x: total training compute (params × tokens, proxy for FLOPs) · y: accuracy. Dashed line + shaded region = Pareto frontier across all models in the library. Points on the frontier (white outline) define state-of-the-art per task at that compute level. Each panel reports the cleanest same-data apples-to-apples comparison: a T³ row paired with the vanilla baseline trained on the identical data mix. We avoid quoting compute-equivalence ratios against cross-corpus baselines (e.g. Qwen, SmolLM), which are trained on different data at very different scales and would produce misleading absolute ratios. Models without confirmed token counts are omitted.

Headline table

Best score per model × task. Click any column header to sort. Hover any model name for full lineage notes.

Known gaps + re-run targets