holocron.optim#

To use holocron.optim you have to construct an optimizer object, that will hold the current state and will update the parameters based on the computed gradients.

Optimizers#

Implementations of recent parameter optimizer for Pytorch modules.

class holocron.optim.Lamb(params: Iterable[Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, scale_clip: Tuple[float, float] | None = None)[source]#

Implements the Lamb optimizer from “Large batch optimization for deep learning: training BERT in 76 minutes”.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • betas (Tuple[float, float], optional) – beta coefficients used for running averages (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS

class holocron.optim.Lars(params: Iterable[Parameter], lr: float = 0.001, momentum: float = 0.0, dampening: float = 0.0, weight_decay: float = 0.0, nesterov: bool = False, scale_clip: Tuple[float, float] | None = None)[source]#

Implements the LARS optimizer from “Large batch training of convolutional networks”.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • momentum (float, optional) – momentum factor (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

  • scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS

class holocron.optim.RaLars(params: Iterable[Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, force_adaptive_momentum: bool = False, scale_clip: Tuple[float, float] | None = None)[source]#

Implements the RAdam optimizer from “On the variance of the Adaptive Learning Rate and Beyond” with optional Layer-wise adaptive Scaling from “Large Batch Training of Convolutional Networks”

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • force_adaptive_momentum (float, optional) – use adaptive momentum if variance is not tractable (default: False)

  • scale_clip (float, optional) – the maximal upper bound for the scale factor of LARS

class holocron.optim.TAdam(params: Iterable[Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, amsgrad: bool = False, dof: float | None = None)[source]#

Implements the TAdam optimizer from “TAdam: A Robust Stochastic Gradient Optimizer”.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dof (int, optional) – degrees of freedom

class holocron.optim.AdaBelief(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)[source]#

Implements the AdaBelief optimizer from “AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients”.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • amsgrad (bool, optional) – whether to use the AMSGrad variant (default: False)

class holocron.optim.AdamP(params: Iterable[Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, amsgrad: bool = False, delta: float = 0.1)[source]#

Implements the AdamP optimizer from “AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights”.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • amsgrad (bool, optional) – whether to use the AMSGrad variant (default: False)

  • delta (float, optional) – delta threshold for projection (default: False)

Optimizer wrappers#

holocron.optim also implements optimizer wrappers.

A base optimizer should always be passed to the wrapper; e.g., you should write your code this way:

>>> optimizer = ...
>>> optimizer = wrapper(optimizer)
class holocron.optim.wrapper.Lookahead(base_optimizer: Optimizer, sync_rate=0.5, sync_period=6)[source]#

Implements the Lookahead optimizer wrapper from “Lookahead Optimizer: k steps forward, 1 step back”.

>>> from torch.optim import AdamW
>>> from holocron.optim.wrapper import Lookahead
>>> model = ...
>>> opt = AdamW(model.parameters(), lr=3e-4)
>>> opt_wrapper = Lookahead(opt)
Parameters:
  • base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer

  • sync_rate (int, optional) – rate of weight synchronization

  • sync_period (int, optional) – number of step performed on fast weights before weight synchronization

class holocron.optim.wrapper.Scout(base_optimizer: Optimizer, sync_rate=0.5, sync_period=6)[source]#

Implements a new optimizer wrapper based on “Lookahead Optimizer: k steps forward, 1 step back”.

Example::
>>> from torch.optim import AdamW
>>> from holocron.optim.wrapper import Scout
>>> model = ...
>>> opt = AdamW(model.parameters(), lr=3e-4)
>>> opt_wrapper = Scout(opt)
Parameters:
  • base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer

  • sync_rate (int, optional) – rate of weight synchronization

  • sync_period (int, optional) – number of step performed on fast weights before weight synchronization