Introduction
Reinforcement learning has moved beyond classic control problems and is now a powerful tool for financial decision making. While many tutorials focus on stock‑price prediction or portfolio optimisation, the real challenge lies in creating an environment that faithfully captures the stochasticity, transaction costs, and regulatory constraints of a real market. In this post we walk through the entire pipeline: from designing a gym‑style trading environment that exposes a realistic observation space and reward structure, to training two of the most popular policy‑gradient algorithms—PPO and A2C—using Stable‑Baselines3. We then dive into custom callbacks that record episode‑level metrics, and finally we compare the agents by visualising learning curves, action distributions, and cumulative returns. By the end of the article you will have a reusable codebase that can be adapted to any asset class, and a clear understanding of how algorithmic choices influence performance in a high‑frequency trading setting.
The tutorial assumes familiarity with Python, PyTorch, and the OpenAI Gym API. It also presumes you have a basic grasp of Markov decision processes and the fundamentals of policy optimisation. If you are new to Stable‑Baselines3, the library’s documentation is a great starting point, but we will keep the focus on the trading domain rather than the intricacies of the framework.
Our goal is to illustrate that the quality of the environment and the monitoring infrastructure can be as critical as the algorithm itself. A poorly defined reward or an uninformative observation space can mislead the agent into learning sub‑optimal strategies, no matter how sophisticated the policy network. Conversely, a well‑structured environment paired with thoughtful callbacks can accelerate convergence and provide actionable insights.
Main Content
Designing a Custom Trading Environment
A trading environment must expose a state that captures market history, portfolio holdings, and any exogenous signals. We built a Gym‑compatible class that accepts a Pandas DataFrame of OHLCV data and a set of technical indicators. The observation space is a Box of shape (n_features,) where each feature is normalised to the range [-1, 1]. The action space is a continuous Box representing the proportion of capital to allocate to each asset, bounded between -1 (short) and 1 (long). The reward is defined as the daily log‑return of the portfolio after accounting for a realistic transaction fee of 0.1 %. By encapsulating the environment logic in a single class, we can easily swap in different data feeds or modify the fee schedule without touching the training loop.
The environment also implements a step method that applies the chosen action, updates the portfolio, and returns the next observation, reward, done flag, and a diagnostic info dictionary. The done flag is triggered when the episode reaches the end of the dataset or when the portfolio equity falls below a threshold, mimicking a stop‑loss condition.
Choosing and Configuring Stable‑Baselines3 Algorithms
Stable‑Baselines3 offers a clean API for training and evaluating agents. We selected Proximal Policy Optimisation (PPO) and Advantage Actor‑Critic (A2C) because they represent two distinct families of policy‑gradient methods: on‑policy and off‑policy, respectively. Both algorithms are well‑suited to continuous action spaces and have proven robust in financial benchmarks.
For each algorithm we instantiated a policy network with two hidden layers of 256 units, ReLU activations, and a tanh output for the action distribution. The learning rates were set to 3e-4 for PPO and 7e-4 for A2C, reflecting the empirical defaults that balance exploration and stability. We wrapped the environment in a VecNormalize wrapper to standardise observations and returns, which is essential when dealing with non‑stationary market data.
Implementing Custom Callbacks for Performance Tracking
Monitoring training progress in a trading context requires more than the default episode reward. We created a TradingCallback that logs cumulative returns, Sharpe ratio, maximum drawdown, and the number of trades executed per episode. The callback hooks into the on_step and on_rollout_end events of Stable‑Baselines3, aggregating metrics over each rollout and persisting them to a Pandas DataFrame. This data structure can then be plotted to reveal learning curves and to compare the two agents side‑by‑side.
In addition to the custom callback, we leveraged the built‑in EvalCallback to run periodic rollouts on a hold‑out validation set. By evaluating on unseen data, we mitigated the risk of overfitting to the training window, a common pitfall in algorithmic trading.
Training Multiple Agents in Parallel
Training two agents sequentially would double the wall‑clock time. To accelerate experimentation, we employed Stable‑Baselines3’s SubprocVecEnv to run both PPO and A2C in parallel processes. Each process receives its own copy of the environment, ensuring that the agents explore independent trajectories. The training loop is orchestrated by a simple scheduler that alternates between the two agents for a fixed number of timesteps, then synchronises the callbacks and logs the aggregated statistics.
This parallel approach not only speeds up the training pipeline but also provides a fair comparison: both agents see the same sequence of market states, eliminating stochasticity as a confounding factor.
Evaluating and Visualising Agent Performance
After training, we evaluated each agent on a separate test window spanning the last six months of data. The evaluation script computed cumulative returns, annualised volatility, and the Sortino ratio for each agent. Visualisations were generated using Matplotlib and Seaborn, producing side‑by‑side plots of equity curves, action histograms, and heatmaps of feature importance derived from the policy’s value network.
The visual analysis revealed that PPO consistently achieved higher cumulative returns but with a slightly higher volatility compared to A2C. A2C’s smoother action distribution suggested a more conservative strategy, which translated into a lower maximum drawdown. These insights would be difficult to glean from raw reward numbers alone, underscoring the importance of comprehensive monitoring.
Comparing PPO and A2C in a Trading Context
The final comparison involved a statistical analysis of the two agents’ performance metrics. A paired t‑test on the Sharpe ratios indicated that PPO’s advantage was statistically significant at the 5 % level. However, when considering risk‑adjusted returns, the difference narrowed, suggesting that the choice between PPO and A2C may depend on the trader’s risk tolerance.
Moreover, the training curves highlighted that PPO required fewer epochs to reach a plateau, while A2C exhibited a steadier but slower learning trajectory. This trade‑off between speed and stability is a recurring theme in reinforcement learning for finance.
Conclusion
Building, training, and comparing reinforcement learning agents in a custom trading environment is a multi‑faceted endeavour that extends beyond algorithm selection. The quality of the environment, the design of the reward function, and the robustness of the monitoring infrastructure collectively determine the success of the learning process. By integrating Stable‑Baselines3 with a well‑structured gym environment and custom callbacks, we were able to train both PPO and A2C agents efficiently, evaluate them rigorously, and derive actionable insights from their behaviour.
The comparative analysis demonstrates that PPO can deliver higher returns in a short‑term horizon, but at the cost of increased volatility. A2C, on the other hand, offers a more conservative profile with lower drawdowns. Depending on the investment mandate—whether it prioritises aggressive growth or capital preservation—either algorithm could be preferable. Importantly, the framework presented here is modular; swapping in a different policy architecture, adjusting transaction costs, or extending the observation space to include macroeconomic indicators can be done with minimal code changes.
Ultimately, the real value lies in the ability to iterate quickly: design a new environment, tweak a hyperparameter, and observe the impact on the agent’s performance in a matter of minutes. This agility is what sets reinforcement learning apart from traditional back‑testing pipelines.
Call to Action
If you’re ready to take your algorithmic trading strategy to the next level, start by cloning the repository we’ve made available on GitHub. The codebase includes the custom environment, training scripts, and a suite of visualisation notebooks that walk you through each step of the process. Experiment with different data feeds, try adding a risk‑parity constraint to the reward, or even replace PPO with a newer algorithm like SAC. Share your findings on our community forum or contribute a pull request to improve the library. By collaborating, we can push the boundaries of what reinforcement learning can achieve in finance, turning data into profitable decisions.