7 min read

Model‑Native Agent: Learning Planning, Memory & Tool Use RL

AI

ThinkTools Team

AI Research Lead

Model‑Native Agent: Learning Planning, Memory & Tool Use RL

Introduction

In the rapidly evolving landscape of artificial intelligence, the pursuit of agents that can reason, remember, and manipulate tools without external scaffolding has become a central research challenge. Traditional reinforcement learning pipelines often rely on a hand‑crafted stack of modules—planning engines, memory buffers, and tool‑specific interfaces—each engineered separately and then orchestrated by a higher‑level controller. While this modularity offers clarity, it also introduces brittleness: the agent must learn to coordinate across heterogeneous components, and any failure in one module can cascade into catastrophic performance drops.

The work we explore in this post takes a different stance. Instead of treating planning, memory, and tool use as distinct services, it embeds them directly into a single neural architecture. By training the entire system end‑to‑end with reinforcement learning, the agent learns to internalize the logic of planning, the persistence of memory, and the procedural steps required to employ multiple tools. The result is a compact, model‑native agent that can tackle arithmetic reasoning tasks—an archetypal benchmark for symbolic manipulation—while simultaneously developing a reusable internal representation of the problem space.

What makes this approach compelling is its scalability. The same architecture can, in principle, be extended to more complex domains such as natural language instruction following, robotic manipulation, or multi‑step scientific hypothesis generation. Moreover, the end‑to‑end training paradigm eliminates the need for hand‑crafted reward shaping or curriculum design beyond a simple progression of task difficulty. In the sections that follow, we unpack the key components of this architecture, illustrate how they interact during training, and discuss the empirical results that demonstrate its effectiveness.

Main Content

The Stage‑Aware Actor‑Critic Backbone

At the heart of the model‑native agent lies a stage‑aware actor‑critic network. Unlike conventional actor‑critic designs that treat the entire episode as a single homogeneous sequence, this architecture partitions the episode into discrete stages—each corresponding to a conceptual sub‑task such as "read the problem," "compute intermediate values," or "verify the final answer." The network maintains a hidden state that evolves across stages, allowing it to retain context and reason about future steps.

The actor component outputs a probability distribution over a vocabulary of primitive actions: arithmetic operations (add, subtract, multiply, divide), memory operations (store, retrieve), and tool invocations (e.g., a symbolic algebra engine). The critic estimates the expected return from each hidden state, providing a baseline that reduces variance in policy gradients. Crucially, both actor and critic share the same underlying transformer encoder, ensuring that the representation learned for planning also informs value estimation.

During training, the agent receives a sparse reward only when the final answer matches the ground truth. This sparse signal forces the network to discover internal heuristics for intermediate reasoning steps. The stage‑aware design mitigates the credit‑assignment problem by aligning the hidden state transitions with logical milestones, thereby making it easier for the critic to predict returns.

Curriculum‑Driven Complexity Scaling

A major obstacle in training agents for symbolic reasoning is the combinatorial explosion of possible problem structures. To address this, the authors employ a curriculum that gradually increases the difficulty of arithmetic puzzles. Early stages involve single‑digit addition or subtraction, while later stages introduce multi‑digit multiplication, division with remainders, and nested expressions.

The curriculum is not hand‑tuned; instead, it is driven by a simple success‑rate threshold. When the agent consistently solves a given difficulty level with a high accuracy, the environment generator introduces more complex expressions. This adaptive approach ensures that the agent is never confronted with problems that are too hard to learn from scratch, while also preventing it from overfitting to trivial patterns.

Internal Planning Through Learned Attention

One of the most striking observations in the experiments is the emergence of attention patterns that resemble human‑like planning. When the agent processes a multi‑step expression, the transformer’s self‑attention layers focus on the sub‑expression that will be evaluated next, effectively simulating a left‑to‑right evaluation order. This emergent behavior is not explicitly programmed; it arises because the agent learns to maximize reward by correctly sequencing operations.

To illustrate, consider the problem "(3 + 5) × 2." The agent first attends to the addition sub‑expression, stores the intermediate result in memory, then retrieves it to perform the multiplication. The hidden state at each stage captures the current sub‑problem and the plan for the next operation. Visualizing the attention weights reveals a clear hierarchy: the first attention peak corresponds to the innermost parentheses, followed by a broader focus on the multiplication operator.

Multi‑Tool Reasoning in a Unified Framework

Beyond arithmetic, the architecture supports the invocation of external tools—such as a symbolic algebra solver or a lookup table—through a unified action space. Each tool is represented as a token in the action vocabulary, and the network learns when to call it based on the context. For example, when encountering a division that would produce a non‑integer result, the agent may choose to call a tool that performs exact rational arithmetic.

Because the tool calls are treated as actions, the agent learns to balance the cost of invoking a tool against the benefit of obtaining a precise intermediate value. In practice, this leads to a hybrid strategy: the agent performs simple arithmetic locally, but defers to the tool for more complex sub‑problems. This dynamic allocation of computational resources mirrors how humans approach problem solving, using mental calculation for easy steps and consulting external references for harder ones.

Empirical Results and Ablation Studies

The authors evaluate the model‑native agent on a benchmark suite of arithmetic reasoning tasks ranging from 1‑digit to 5‑digit expressions. The agent achieves an overall accuracy of 94 % on the hardest test set, outperforming baseline modular pipelines that rely on separate planners and memory modules. Ablation studies reveal that removing the stage‑aware design reduces accuracy by 7 %, underscoring its importance for credit assignment.

Another key finding is the robustness of the learned internal memory. When the agent is presented with a sequence of problems that share sub‑expressions, it efficiently reuses stored intermediate results, reducing the number of actions required per problem. This reuse is quantified by a 15 % reduction in average episode length compared to a baseline that recomputes every sub‑expression.

Conclusion

The exploration of a model‑native agent that learns planning, memory, and multi‑tool reasoning end‑to‑end marks a significant step toward more autonomous and efficient AI systems. By embedding these capabilities into a single neural architecture and training it with reinforcement learning, the agent sidesteps the brittleness of modular pipelines and discovers human‑like strategies for problem solving. The stage‑aware actor‑critic backbone, curriculum‑driven training, and unified action space collectively enable the agent to tackle complex arithmetic reasoning tasks with high accuracy.

Beyond arithmetic, the principles demonstrated here have broad applicability. Any domain that requires sequential decision making, intermediate computation, and selective tool use—such as code synthesis, scientific discovery, or interactive dialogue—could benefit from a model‑native approach. Future work may investigate scaling the architecture to larger transformer models, integrating richer tool ecosystems, or applying the same training paradigm to real‑world robotics scenarios.

Call to Action

If you are a researcher or practitioner interested in pushing the boundaries of autonomous reasoning, I encourage you to experiment with the model‑native agent framework described here. Start by reproducing the arithmetic benchmark, then extend the action vocabulary to include domain‑specific tools. Share your findings on GitHub or in a preprint, and let the community help refine this promising direction. Together, we can move closer to AI agents that reason, remember, and act with the fluidity of human cognition.

We value your privacy

We use cookies, including Google Analytics, to improve your experience on our site. By accepting, you agree to our use of these cookies. Learn more