Aggregated lm-eval-harness results across the T³ architectural lineage and canonical baselines. Each row shows the best checkpoint per (model, task); hover any model name for the lineage / methodology context note.
Training data mix (T³ runs): "Ultimate Mix 5B" — fineweb_edu 40%, dclm 20%, stack_edu 10%, finemath 10%, cosmopedia 10%, wikipedia 10% (Qwen substrate adds wiki). Total budget: 5B tokens unless otherwise tagged.
Validation perplexity (val_ppl column below) is on WikiText-103, which is held out of the training mix. It's an OOD generalization measure, surfaced as provenance context — to identify the checkpoint and confirm the run completed — not as a headline.
We are not claiming best-in-class perplexity at any scale. PPL competitiveness on a single OOD corpus is not what T³ is for, and there is direct empirical evidence in the library that PPL and downstream capability decouple: in the v3.7+ Phase 1A σ-MLP width sweep (sh16 / sh32 / sh64), val PPL improved monotonically with σ width while downstream benchmark accuracy did not track — wider σ produced lower PPL but the same or worse benchmark scores. (Cortex memory 634d24dfb0822b05, 2026-04-30.)
The headline claim is parameter-and-compute efficiency on capability benchmarks against a same-data baseline. PPL is shown for reproducibility, not for ranking.
log-x: parameter count · y: accuracy · larger dots = more recent T³ versions. Cells where T³ at smaller scale matches or beats baselines at larger scale are the headline result.
log-x: total training compute (params × tokens, proxy for FLOPs) · y: accuracy. Dashed line + shaded region = Pareto frontier across all models in the library. Points on the frontier (white outline) define state-of-the-art per task at that compute level. Each panel reports the cleanest same-data apples-to-apples comparison: a T³ row paired with the vanilla baseline trained on the identical data mix. We avoid quoting compute-equivalence ratios against cross-corpus baselines (e.g. Qwen, SmolLM), which are trained on different data at very different scales and would produce misleading absolute ratios. Models without confirmed token counts are omitted.
Best score per model × task. Click any column header to sort. Hover any model name for full lineage notes.