holocron.optim¶
To use holocron.optim
you have to construct an optimizer object, that will hold
the current state and will update the parameters based on the computed gradients.
Optimizers¶
Implementations of recent parameter optimizer for Pytorch modules.
- class holocron.optim.Lamb(params: Iterable[Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, scale_clip: Tuple[float, float] | None = None)[source]¶
Implements the Lamb optimizer from “Large batch optimization for deep learning: training BERT in 76 minutes”.
- Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
betas (Tuple[float, float], optional) – beta coefficients used for running averages (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS
- class holocron.optim.Lars(params: Iterable[Parameter], lr: float = 0.001, momentum: float = 0.0, dampening: float = 0.0, weight_decay: float = 0.0, nesterov: bool = False, scale_clip: Tuple[float, float] | None = None)[source]¶
Implements the LARS optimizer from “Large batch training of convolutional networks”.
- Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS
- class holocron.optim.RAdam(params: Iterable[Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0)[source]¶
Implements the RAdam optimizer from “On the variance of the Adaptive Learning Rate and Beyond”.
- Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
- class holocron.optim.RaLars(params: Iterable[Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, force_adaptive_momentum: bool = False, scale_clip: Tuple[float, float] | None = None)[source]¶
Implements the RAdam optimizer from “On the variance of the Adaptive Learning Rate and Beyond” with optional Layer-wise adaptive Scaling from “Large Batch Training of Convolutional Networks”
- Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
force_adaptive_momentum (float, optional) – use adaptive momentum if variance is not tractable (default: False)
scale_clip (float, optional) – the maximal upper bound for the scale factor of LARS
- class holocron.optim.TAdam(params: Iterable[Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, amsgrad: bool = False, dof: float | None = None)[source]¶
Implements the TAdam optimizer from “TAdam: A Robust Stochastic Gradient Optimizer”.
- Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dof (int, optional) – degrees of freedom
- class holocron.optim.AdaBelief(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)[source]¶
Implements the AdaBelief optimizer from “AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients”.
- Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
amsgrad (bool, optional) – whether to use the AMSGrad variant (default: False)
Optimizer wrappers¶
holocron.optim
also implements optimizer wrappers.
A base optimizer should always be passed to the wrapper; e.g., you should write your code this way:
>>> optimizer = ...
>>> optimizer = wrapper(optimizer)
- class holocron.optim.wrapper.Lookahead(base_optimizer: Optimizer, sync_rate=0.5, sync_period=6)[source]¶
Implements the Lookahead optimizer wrapper from “Lookahead Optimizer: k steps forward, 1 step back”.
- Parameters:
base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer
sync_rate (int, optional) – rate of weight synchronization
sync_period (int, optional) – number of step performed on fast weights before weight synchronization
- class holocron.optim.wrapper.Scout(base_optimizer: Optimizer, sync_rate=0.5, sync_period=6)[source]¶
Implements a new optimizer wrapper based on “Lookahead Optimizer: k steps forward, 1 step back”.
- Parameters:
base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer
sync_rate (int, optional) – rate of weight synchronization
sync_period (int, optional) – number of step performed on fast weights before weight synchronization
Learning rate schedulers¶
holocron.optim.lr_scheduler
provides several methods to adjust the learning
rate based on the number of epochs. holocron.optim.lr_scheduler.OneCycleScheduler
allows dynamic learning rate reducing based on some validation measurements.
Learning rate scheduling should be applied after optimizer’s update; e.g., you should write your code this way:
>>> scheduler = ...
>>> for epoch in range(100):
>>> train(...)
>>> validate(...)
>>> scheduler.step()
- class holocron.optim.lr_scheduler.OneCycleScheduler(optimizer, total_size, max_lr=None, warmup_ratio=0.3, phases=None, base_ratio=0.2, final_ratio=None, cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1)[source]¶
Implements the One Cycle scheduler from “A disciplined approach to neural network hyper-parameters”. Please note that this implementation was made before pytorch supports it, using the official Pytorch implementation is advised.
- Parameters:
optimizer (Optimizer) – Wrapped optimizer.
total_size (int) – Number of training iterations to be performed
max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_lr - base_lr). The lr at any cycle is the sum of base_lr and some scaling of the amplitude; therefore max_lr may not actually be reached depending on scaling function.
warmup_ratio (float) – ratio of iterations used to reach max_lr
phases (tuple) – specify the scaling mode of both phases (possible values: ‘linear’, ‘cosine’)
base_ratio (float) – ratio between base_lr and max_lr during warmup phase
final_ratio (float) – ratio between base_lr and max_lr during last phase
cycle_momentum (bool) – If
True
, momentum is cycled inversely to learning rate between ‘base_momentum’ and ‘max_momentum’. Default: Truebase_momentum (float or list) – Lower momentum boundaries in the cycle for each parameter group. Note that momentum is cycled inversely to learning rate; at the peak of a cycle, momentum is ‘base_momentum’ and learning rate is ‘max_lr’. Default: 0.8
max_momentum (float or list) – Upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_momentum - base_momentum). The momentum at any cycle is the difference of max_momentum and some scaling of the amplitude; therefore base_momentum may not actually be reached depending on scaling function. Note that momentum is cycled inversely to learning rate; at the start of a cycle, momentum is ‘max_momentum’ and learning rate is ‘base_lr’ Default: 0.9
last_epoch (int) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_epoch=-1, the schedule is started from the beginning. Default: -1