Episodic Policy Gradient

The episodic policy gradient is a modified version of the Policy Gradient for Portfolio Optimization in which the training sequence is merged with the execution of the environment’s episode. In the original policy gradient, the agent executes a full episode and then, it not only interacts with it (except if performing a logging operation in the training environment). This is not a problem in deterministic environments such as PortfolioOptimizationEnv, but it can be if the environment implements methods that simulate dynamic markets (such as state noise).

Therefore, to address this limitation, the episodic policy gradient algorithm was developed, changing the training sequence so that the the 6 steps of the training sequence of the Policy Gradient for Portfolio Optimization are performed after the execution of every simulation step. Thus, the following training loop is implemented:

A simulation step is run, updating the replay buffer experiences.
The 6 training steps listed in Policy Gradient for Portfolio Optimization are executed gradient_steps times (this parameter can be set in the train method).

This algorithm is very similar to the test sequence of Policy Gradient for Portfolio Optimization. Additionally, The testing sequence, validation sequence is identical to the original Policy Gradient approach.

Note

The algorithm is called episodic because it needs to run for episodes number of episodes.

Logging

A difference between this algorithm and Policy Gradient for Portfolio Optimization is that there is no need to set a validation period, since several episodes are run and the logging is performed automatically.