hive.agents.ddpg module

class hive.agents.ddpg.DDPG(observation_space, action_space, representation_net=None, actor_net=None, critic_net=None, init_fn=None, actor_optimizer_fn=None, critic_optimizer_fn=None, critic_loss_fn=None, stack_size=1, replay_buffer=None, discount_rate=0.99, n_step=1, grad_clip=None, reward_clip=None, soft_update_fraction=0.005, batch_size=64, logger=None, log_frequency=100, update_frequency=1, action_noise=0, min_replay_history=1000, device='cpu', id=0)[source]

Bases: TD3

An agent implementing the DDPG algorithm. It is implemented by fixing the n_critics, policy_update_frequency, target_noise, and target_noise_clip parameters of the TD3 agent.

Parameters
  • observation_space (gym.spaces.Box) – Observation space for the agent.

  • action_space (gym.spaces.Box) – Action space for the agent.

  • representation_net (FunctionApproximator) – The network that encodes the observations that are then fed into the actor_net and critic_net. If None, defaults to Identity.

  • actor_net (FunctionApproximator) – The network that takes the encoded observations from representation_net and outputs the representations used to compute the actions (ie everything except the last layer).

  • critic_net (FunctionApproximator) – The network that takes two inputs: the encoded observations from representation_net and actions. It outputs the representations used to compute the values of the actions (ie everything except the last layer).

  • init_fn (InitializationFn) – Initializes the weights of agent networks using create_init_weights_fn.

  • actor_optimizer_fn (OptimizerFn) – A function that takes in the list of parameters of the actor returns the optimizer for the actor. If None, defaults to Adam.

  • critic_optimizer_fn (OptimizerFn) – A function that takes in the list of parameters of the critic returns the optimizer for the critic. If None, defaults to Adam.

  • critic_loss_fn (LossFn) – The loss function used to optimize the critic. If None, defaults to MSELoss.

  • stack_size (int) – Number of observations stacked to create the state fed to the agent.

  • replay_buffer (BaseReplayBuffer) – The replay buffer that the agent will push observations to and sample from during learning. If None, defaults to CircularReplayBuffer.

  • discount_rate (float) – A number between 0 and 1 specifying how much future rewards are discounted by the agent.

  • n_step (int) – The horizon used in n-step returns to compute TD(n) targets.

  • grad_clip (float) – Gradients will be clipped to between [-grad_clip, grad_clip].

  • reward_clip (float) – Rewards will be clipped to between [-reward_clip, reward_clip].

  • soft_update_fraction (float) – The weight given to the target net parameters in a soft (polyak) update. Also known as tau.

  • batch_size (int) – The size of the batch sampled from the replay buffer during learning.

  • logger (Logger) – Logger used to log agent’s metrics.

  • log_frequency (int) – How often to log the agent’s metrics.

  • update_frequency (int) – How frequently to update the agent. A value of 1 means the agent will be updated every time update is called.

  • action_noise (float) – The standard deviation for the noise added to the action taken by the agent during training.

  • min_replay_history (int) – How many observations to fill the replay buffer with before starting to learn.

  • device – Device on which all computations should be run.

  • id – Agent identifier.