Brumby-14B: Power Retention Beats Attention in LLMs

Introduction

The transformer architecture, introduced in 2017 with the landmark paper Attention Is All You Need, has become the backbone of every large language model (LLM) that has followed. From OpenAI’s GPT series to Anthropic’s Claude, Google’s Gemini, and Meta’s Llama, the attention mechanism—allowing a model to weigh every token against every other token—has been the engine that powers contextual understanding and generation. Yet, as the scale of these models has ballooned, the quadratic cost of attention in both compute and memory has emerged as a hard bottleneck. When a model must process documents that span thousands or millions of tokens, the quadratic growth in operations turns attention into a computational Achilles’ heel.

In late October 2025, a small startup called Manifest AI announced a radical departure from this paradigm. Their new model, Brumby‑14B‑Base, is a retrained variant of the open‑source Qwen3‑14B‑Base that has replaced every attention layer with a novel mechanism called Power Retention. The result is an architecture that retains the expressive power of attention while achieving a constant‑time per‑token cost that does not grow with sequence length. Remarkably, the model was trained for just $4,000 on 32 Nvidia H100 GPUs—less than 2 % of the cost of training a comparable transformer from scratch. This post explores the technical innovations, performance implications, and broader economic impact of Brumby‑14B, and asks whether the transformer’s dominance is finally being challenged.

Main Content

From Attention to Retention

In a conventional transformer, each token generates a set of queries (Q), keys (K), and values (V). The attention operation then computes a similarity matrix between every pair of tokens, producing a weighted sum of values that captures global context. While this mechanism endows transformers with unparalleled flexibility, it also forces the model to perform O(n²) operations for a sequence of length n, and to store an O(n²) similarity matrix in memory. Power Retention sidesteps this quadratic cost by replacing the global similarity operation with a recurrent state update. Each layer maintains a fixed‑size memory matrix S that is updated at every time step using the incoming key, value, and a learned gating signal. The update rule resembles that of a recurrent neural network (RNN), compressing past information into a latent state that can be accessed in constant time. Because the state update involves only local matrix operations, the per‑token cost remains constant regardless of whether the model processes a thousand or a million tokens. At the same time, the recurrence involves tensor powers of the input—hence the name “power retention”—which allows the model to capture higher‑order dependencies between past and present tokens. In effect, Power Retention offers a theoretically unbounded capacity for long‑term dependencies while preserving the computational efficiency of an RNN.

Retraining Efficiency

One of the most striking aspects of Brumby‑14B is how quickly it recovers performance after the architectural swap. Manifest AI began with the weights of Qwen3‑14B‑Base, a transformer that had been trained to exploit attention dynamics. When the attention layers were removed and replaced with Power Retention, the existing weights no longer aligned with the new computation graph, and the model initially “forgot” how to apply its knowledge. The team addressed this mismatch by continuing training for only about 3,000 steps—roughly 60 hours on 32 H100 GPUs—allowing the weights to readjust to the new retention‑based dynamics. The loss curves released by Manifest AI show that within those few thousand steps, Brumby’s training loss converged to that of the Qwen3 baseline, and the model regained full performance on downstream benchmarks. This rapid convergence demonstrates that attention‑free systems can inherit transformer knowledge with a fraction of the training time and cost, a fact that could democratize large‑scale experimentation for smaller research groups.

Benchmark Performance

Across a suite of standard evaluation tasks, Brumby‑14B‑Base matches or surpasses transformer baselines of comparable scale. On reasoning‑heavy benchmarks such as GSM8K and MATH, Brumby achieves scores that are on par with or better than Qwen3‑14B and GLM‑4.5‑Air. In long‑context reasoning tasks, the model’s performance is particularly strong, reflecting the advantage of a constant‑time per‑token architecture. While it lags slightly behind transformers on knowledge‑dense evaluations like MMLU‑Pro, the gap is modest and may close as the architecture matures. The pattern suggests that recurrent or retention‑based systems may hold a structural advantage for tasks that require reasoning over extended temporal or logical dependencies—a domain where attention architectures traditionally struggle.

Hardware and Inference

Power Retention’s design translates directly into hardware efficiency. Because the state update requires only local matrix operations, inference can be implemented with linear complexity in sequence length. Manifest AI’s in‑house CUDA framework Vidrial delivers hundreds‑fold speedups over attention on very long contexts, with typical hardware utilization of 80–85 % compared to FlashAttention2’s 70–75 % or Mamba’s 50–60 %. The combination of higher utilization and fewer FLOPs yields a reported 100× speedup on long sequences, although the team notes that large‑scale production workloads have yet to be fully stress‑tested. The kernels are written in Triton, making them compatible with both NVIDIA and AMD accelerators, and integration with inference engines such as vLLM remains an active area of development.

Economic Implications

The most eye‑catching statistic from Brumby‑14B’s release is the training cost: 14 billion parameters trained for $4,000. This two‑order‑of‑magnitude reduction in cost challenges the prevailing assumption that large foundation models are prohibitively expensive. Manifest AI’s founder Jacob Buckman notes that retraining becomes easier as models scale; the number of steps required to successfully retrain a model decreases with its parameter count. While the company has not yet validated the cost of retraining a 700 billion‑parameter model, Buckman projects a range of $10,000–$20,000—still far below the budgets required for transformer training at that scale. If these projections hold, the barrier to entry for large‑scale experimentation could be lowered dramatically, enabling a broader swath of researchers and companies to develop and fine‑tune powerful models.

Deployment and Integration

Converting an existing transformer checkpoint into a Power Retention model is designed to be straightforward. Manifest AI claims that a single line of code change—installing the retention package and swapping the attention layers—allows a model to resume training from its previous checkpoint. After a modest number of GPU hours, the model typically recovers its original performance and gains the efficiency benefits of the new architecture. The team has also released specialized CUDA kernels and plans to integrate Power Retention into popular inference engines. Distributed inference is expected to be cleaner, as the recurrent‑state architecture does not exacerbate the instability that can arise with attention‑based models.

Industry Reception

The announcement sparked immediate debate on X (formerly Twitter). Some researchers, including Meta’s Ariel @redtachyon, criticized the $4,000 claim as misleading, pointing out that the model was retrained from a pre‑existing transformer checkpoint rather than trained from scratch. Manifest AI’s response clarified that the $4,000 figure refers to the incremental cost of retraining, not the total cost of developing a transformer from the ground up. While the controversy highlighted the importance of transparency in reporting training costs, it also underscored the broader impact of Brumby‑14B’s approach: the possibility of building high‑performance LLMs at a fraction of the cost.

Conclusion

Brumby‑14B‑Base represents more than an engineering curiosity; it is a concrete demonstration that the transformer’s dominance may be vulnerable to a well‑engineered alternative. By replacing attention with Power Retention, Manifest AI has shown that performance parity with state‑of‑the‑art transformers is achievable at a fraction of the computational cost, while also breaking the long‑context bottleneck without exotic hardware. The implications are twofold. First, the economics of training and serving large models could shift dramatically, lowering the barrier to entry for open research and smaller organizations. Second, the architectural diversity of AI models may expand again, reigniting theoretical and empirical exploration after a decade of transformer monoculture. As Manifest AI’s founder Jacob Buckman notes, “The end of the transformer era is not yet here. Our release is just one step forward in a long march toward the future.”

Call to Action

If you’re a researcher, engineer, or product manager interested in cutting‑edge LLM architectures, now is the time to explore Power Retention. Manifest AI has released the Brumby‑14B checkpoint, the retention package, and the Vidrial kernels, all of which can be integrated into existing workflows with minimal friction. By retraining a transformer checkpoint with Power Retention, you can achieve near‑state‑of‑the‑art performance while slashing training and inference costs. Join the conversation, experiment with the new architecture, and help shape the next chapter of large‑scale language modeling. The future of AI may well hinge on how efficiently we can model long‑term dependencies, and Power Retention offers a promising path forward.

Brumby-14B: Power Retention Beats Attention in LLMs

Table of Contents

Share This Post

Introduction

Main Content

From Attention to Retention

Retraining Efficiency

Benchmark Performance

Hardware and Inference

Economic Implications

Deployment and Integration

Industry Reception

Conclusion

Call to Action

Related Articles

Cisco Open-Weights Time Series Model: Decoder‑Only Transformer

Google Colab Now Seamlessly Accesses Kaggle Hub in One Click

Hierarchical Bayesian Regression in NumPyro: A Full Workflow

We value your privacy