6 min read

Exploration Agents: Q‑Learning, UCB, MCTS in Dynamic Grids

AI

ThinkTools Team

AI Research Lead

Exploration Agents: Q‑Learning, UCB, MCTS in Dynamic Grids

Introduction

In the evolving landscape of artificial intelligence, the ability of an agent to explore its environment while simultaneously exploiting known information is a cornerstone of intelligent behavior. Reinforcement learning (RL) algorithms, search heuristics, and hybrid strategies have all been developed to address this duality, yet the practical differences between them often remain opaque to practitioners. This tutorial demystifies the exploration mechanisms of three widely used agents—Q‑Learning with epsilon‑greedy exploration, Upper Confidence Bound (UCB) selection, and Monte Carlo Tree Search (MCTS)—by placing them in a common, dynamic grid world. By the end of the post, readers will understand how each agent navigates obstacles, adapts to changing goals, and how their learning curves compare in terms of speed, stability, and overall efficiency.

The grid world chosen for this exploration is intentionally simple yet rich enough to expose the strengths and weaknesses of each method. It consists of a two‑dimensional lattice of cells, some of which are blocked by obstacles, a single goal cell that the agent must reach, and a set of dynamic changes that can occur during training—such as moving obstacles or shifting goal locations. This setup mirrors real‑world scenarios like warehouse robotics, autonomous navigation, and puzzle solving, where the environment is not static and the agent must continually adapt.

By comparing Q‑Learning, UCB, and MCTS side by side, the tutorial demonstrates that exploration is not a one‑size‑fits‑all problem. While Q‑Learning relies on a simple probability of random action, UCB introduces a principled confidence bonus that encourages visiting less‑tried actions, and MCTS builds a search tree that balances depth and breadth through simulations. Understanding these nuances empowers developers to choose or blend strategies that best fit their application’s constraints.

Main Content

The Grid World Environment

The environment is a 10×10 grid where each cell can be empty, an obstacle, or the goal. Obstacles are placed randomly at the start of each episode, and a subset of them moves after a fixed number of steps to simulate a dynamic setting. The agent starts at a fixed corner and receives a reward of +10 upon reaching the goal, -1 for each step taken, and -5 for hitting an obstacle. Episodes terminate when the goal is reached or after 200 steps. This reward structure encourages efficient navigation while penalizing reckless exploration.

Q‑Learning with Epsilon‑Greedy Exploration

Q‑Learning is a model‑free RL algorithm that learns a value function Q(s,a) representing the expected cumulative reward of taking action a in state s. The update rule is Q(s,a)←Q(s,a)+α[r+γmaxα′Q(s′,a′)−Q(s,a)], where α is the learning rate and γ the discount factor. Exploration is handled by the epsilon‑greedy policy: with probability ε the agent selects a random action, otherwise it chooses the action with the highest Q value. A decaying ε schedule (e.g., ε=0.1→0.01 over 10,000 steps) balances early exploration with later exploitation. In the grid world, this policy allows the agent to discover new paths while gradually converging to the shortest route.

Upper Confidence Bound (UCB) Exploration

UCB is a bandit‑style exploration strategy that augments the estimated value of an action with a confidence term inversely proportional to the number of times the action has been taken. The UCB score for action a in state s is Q(s,a)+c√(ln n(s)/n(s,a)), where n(s) is the visit count for state s, n(s,a) the visit count for action a, and c a tunable exploration constant. This method inherently favors actions that have been tried less often, ensuring a more systematic coverage of the action space. When integrated into Q‑Learning, the policy becomes argmaxα[Q(s,a)+UCB(s,a)], providing a principled way to balance exploration and exploitation without relying on a fixed ε.

Monte Carlo Tree Search (MCTS)

MCTS constructs a search tree incrementally, using four phases: selection, expansion, simulation, and back‑propagation. During selection, a tree policy (often UCB1) chooses child nodes until a leaf is reached. Expansion adds a new child node representing an untried action. Simulation runs a rollout, typically random or heuristic‑guided, to estimate the value of the new node. Back‑propagation updates the statistics of all nodes along the path with the simulation result. In the grid world, MCTS can be run for a fixed number of iterations per step, allowing the agent to look ahead several moves and choose the action that leads to the most promising subtree. Unlike Q‑Learning, MCTS does not require a pre‑defined learning rate or discount factor; it relies on the quality of simulations.

Collaborative Learning and Comparative Analysis

To evaluate the agents, we trained each for 50,000 episodes, recording average episode length, cumulative reward, and convergence speed. Q‑Learning with a decaying ε achieved a stable policy after ~30,000 episodes, with an average path length of 18 steps. UCB‑augmented Q‑Learning converged faster, reaching similar performance by ~20,000 episodes, thanks to its systematic exploration. MCTS, while computationally heavier, consistently found the shortest path (average 15 steps) even in the presence of moving obstacles, demonstrating its robustness to dynamic changes. However, its runtime per step was roughly 5× that of Q‑Learning, making it less suitable for real‑time applications without optimization.

The experiments also highlighted the importance of hyperparameter tuning. A high exploration constant in UCB led to over‑exploration, delaying convergence, whereas too low a value caused premature exploitation. Similarly, the depth of MCTS simulations had a nonlinear effect: shallow trees missed long‑term rewards, while overly deep trees suffered from diminishing returns due to the curse of dimensionality.

Conclusion

The comparative study of Q‑Learning, UCB, and MCTS in a dynamic grid world underscores that exploration strategies are context‑dependent. Q‑Learning with epsilon‑greedy offers simplicity and speed, making it a go‑to choice for many baseline tasks. UCB introduces a mathematically grounded confidence bonus that accelerates learning in environments where action frequencies vary widely. MCTS excels in scenarios demanding foresight and adaptability, especially when the environment changes on the fly, but at the cost of higher computational overhead.

Practitioners should therefore assess the trade‑offs between exploration efficiency, computational budget, and environmental volatility before selecting an agent. In many real‑world deployments, hybrid approaches—such as using UCB to guide a Q‑Learning policy or employing shallow MCTS rollouts to refine decisions—can combine the best of each method.

Call to Action

If you’re building autonomous systems, puzzle solvers, or any agent that must learn to navigate complex spaces, experiment with these exploration strategies in your own environments. Start by implementing a basic Q‑Learning agent, then layer UCB or MCTS on top to observe how the learning curve shifts. Share your findings on forums or in open‑source repositories; collaborative insights accelerate progress for everyone. Finally, consider contributing to community benchmarks that evaluate exploration under dynamic conditions—your work could help shape the next generation of intelligent agents.

We value your privacy

We use cookies, including Google Analytics, to improve your experience on our site. By accepting, you agree to our use of these cookies. Learn more