rlportfolio.algorithm.policy_gradient module

class PolicyGradient

Bases: object

Class implementing policy gradient algorithm to train portfolio optimization agents. This class implements the work introduced in the following article: https://doi.org/10.48550/arXiv.1706.10059.

Note

During testing, the agent is optimized through online learning. The parameters of the policy is updated repeatedly after a constant period of time. To disable it, set learning rate to 0.

train_env: Environment used to train the agent

train_policy: Policy used in training.

test_env: Environment used to test the agent.

test_policy: Policy after test online learning.

__init__(env: ~gymnasium.core.Env, policy: type[~torch.nn.modules.module.Module] = <class 'rlportfolio.policy.eiie.EIIE'>, policy_kwargs: dict[str, ~typing.Any] = None, replay_buffer: type[~rlportfolio.algorithm.buffers.replay_buffers.SequentialReplayBuffer] = <class 'rlportfolio.algorithm.buffers.replay_buffers.GeometricReplayBuffer'>, batch_size: int = 100, sample_bias: float = 1.0, sample_from_start: bool = False, lr: float = 0.001, polyak_avg_tau: float = 1, action_noise: str | None = None, action_epsilon: float | ~typing.Callable[[int], float] = 0, action_alpha: float | ~typing.Callable[[int], float] = 1.0, parameter_noise: float | ~typing.Callable[[int], float] = 0, optimizer: type[~torch.optim.optimizer.Optimizer] = <class 'torch.optim.adamw.AdamW'>, use_tensorboard: bool = False, summary_writer_kwargs: dict[str, ~typing.Any] = None, device: str = 'cpu') → PolicyGradient

Initializes Policy Gradient for portfolio optimization.

Parameters:

env – Training environment.
policy – Policy architecture to be used.
policy_kwargs – Arguments to be used in the policy network.
validation_env – Validation environment.
validation_kwargs – Arguments to be used in the validation step.
replay_buffer – Class of replay buffer to be used to sample experiences in training.
batch_size – Batch size to train neural network.
sample_bias – Probability of success of a trial in a geometric distribution. Only used if buffer is GeometricReplayBuffer.
sample_from_start – If True, will choose a sequence starting from the start of the buffer. Otherwise, it will start from the end. Only used if buffer is GeometricReplayBuffer.
lr – policy neural network learning rate.
polyak_avg_tau – Tau parameter to be used in Polyak average (bigger than or equal to 0 and smaller than or equal to 1). The bigger the parameter, the bigger new training steps influence the target policy.
action_noise – Name of the model to be used in the action noise. The options are “logarithmic”, “logarithmic_const”, “dirichlet” or None. If None, no action noise is applied.
action_epsilon – Noise logarithmic parameter (bigger than or equal to 0) to be applied to performed actions during training. It can be a value or a function whose argument is the number of training episodes/steps and that outputs the noise value.
action_alpha – Alpha parameter (bigger than 1) to be used to create a Dirichlet distribution in the “dirichlet” noise model. It can be a value or a function whose argument is the number of training episodes/steps and that outputs the noise value.
parameter_noise – Noise parameter (bigger than or equal to 0) to be applied to the parameters of the policy network during training. It can be a value or a function whose argument is the number of training episodes/ steps and that outputs the noise value. Currently not implemented.
optimizer – Optimizer of neural network.
use_tensorboard – If true, training logs will be added to tensorboard.
summary_writer_kwargs – Arguments to be used in PyTorch’s tensorboard summary writer.
device – Device where neural network is run.

test(env: Env, gradient_steps: int = 1, use_train_buffer: bool = False, update_buffer: bool = True, policy: Module | None = None, replay_buffer: SequentialReplayBuffer | None = None, batch_size: int | None = None, sample_bias: float | None = None, sample_from_start: bool | None = None, lr: float = None, optimizer: type[Optimizer] | None = None, plot_index: int | None = None) → dict[str, float]

Tests the policy with online learning. The test sequence runs an episode of the environment and performs gradient_step training steps after each simulation step in order to perform online learning. To disable online learning, set gradient steps or learning rate to 0, or set a very big batch size.

Parameters:

env – Environment to be used in testing.
gradient_steps – Number of gradient ascent steps to perform after each simulation step.
use_train_buffer – If True, the test period also makes use of experiences in the training replay buffer to perform online training. Set this option to True if the test period is immediately after the training period.
update_buffer – If True, replay buffers will be updated after gradient ascent.
policy – Policy architecture to be used. If None, it will use the training architecture.
replay_buffer – Class of replay buffer to be used. If None, it will use the training replay buffer.
batch_size – Batch size to train neural network. If None, it will use the training batch size.
sample_bias – Probability of success of a trial in a geometric distribution. Only used if buffer is GeometricReplayBuffer. If None, it will use the training sample bias.
sample_from_start – If True, will choose a sequence starting from the start of the buffer. Otherwise, it will start from the end. Only used if buffer is GeometricReplayBuffer. If None, it will use the training sample_from_start.
lr – Policy neural network learning rate. If None, it will use the training learning rate.
optimizer – Optimizer of neural network. If None, it will use the training optimizer.
plot_index – Index (x-axis) to be used to plot metrics. If None, no plotting is performed.

Note

To disable online learning, set learning rate to 0 or a very big batch size.

Returns:: Dictionary with episode metrics.

Training sequence. Initially, the algorithm runs a full episode without any training in order to full replay buffers. Then, several training steps are executed using data from the replay buffer in order to maximize the objective function. At the end of each training step, the buffer is updated with new outputs of the policy network.

Note

The validation step is run after every val_period training steps. This step simply runs an episode of the testing environment performing val_gradient_step training steps after each simulation step, in order to perform online learning. To disable online learning, set gradient steps or learning rate to 0, or set a very big batch size.

Parameters:

steps – Number of training steps.
logging_period – Number of training steps to perform gradient ascent before running a full episode and log the agent’s metrics. If None, logging will be performed in the end of all the training procedure.
val_period – Number of training steps to perform before running a full episode in the validation environment and log metrics. If None, validation will happen in the end of all the training procedure.
val_env – Validation environment. If None, no validation is performed.
val_gradient_steps – Number of gradient ascent steps to perform after each simulation step in the validation period.
val_use_train_buffer – If True, the validation period also makes use of experiences in the training replay buffer to perform online training. Set this option to True if the validation period is immediately after the training period.
val_replay_buffer – Type of replay buffer to use in validation. If None, it will be equal to the training replay buffer.
val_batch_size – Batch size to use in validation. If None, the training batch size is used.
val_sample_bias – Sample bias to be used if replay buffer is GeometricReplayBuffer. If None, the training sample bias is used.
val_sample_from_start – If True, the GeometricReplayBuffer will perform geometric distribution sampling from the beginning of the ordered experiences. If None, the training sample bias is used.
val_lr – Learning rate to perform gradient ascent in validation. If None, the training learning rate is used instead.
val_optimizer – Type of optimizer to use in the validation. If None, the same type used in training is set.
progress_bar – If “permanent”, a progress bar is displayed and is kept when completed. If “temporary”, a progress bar is displayed but is deleted when completed. If None (or any other value), no progress bar is displayed.
name – Name of the training sequence (it is displayed in the progress bar).

Returns:

(metrics, val_metrics).

metrics: Dictionary with metrics of the agent performance in the training: environment. If None, no training was performed.
val_metrics: Dictionary with metrics of the agent performance in the: validation environment. If None, no validation was performed.

Return type:

The following tuple is returned