holocron.optim¶
To use holocron.optim
you have to construct an optimizer object, that will hold
the current state and will update the parameters based on the computed gradients.
Optimizers¶
- class holocron.optim.Lamb(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, scale_clip=None)[source]¶
Implements the Lamb optimizer from https://arxiv.org/pdf/1904.00962v3.pdf.
- Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
betas (Tuple[float, float], optional) – beta coefficients used for running averages (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS
- class holocron.optim.Lars(params, lr=0.001, momentum=0, dampening=0, weight_decay=0, nesterov=False, scale_clip=None)[source]¶
Implements the LARS optimizer from https://arxiv.org/pdf/1708.03888.pdf
- Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS
- class holocron.optim.RAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]¶
Implements the RAdam optimizer from https://arxiv.org/pdf/1908.03265.pdf
- Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
- class holocron.optim.RaLars(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, force_adaptive_momentum=False, scale_clip=None)[source]¶
Implements the RAdam optimizer from https://arxiv.org/pdf/1908.03265.pdf with optional Layer-wise adaptive Scaling from https://arxiv.org/pdf/1708.03888.pdf
- Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
force_adaptive_momentum (float, optional) – use adaptive momentum if variance is not tractable (default: False)
scale_clip (float, optional) – the maximal upper bound for the scale factor of LARS
Optimizer wrappers¶
holocron.optim
implements optimizer wrappers.
A base optimizer should always be passed to the wrapper; e.g., you should write your code this way:
>>> optimizer = ...
>>> optimizer = wrapper(optimizer)
- class holocron.optim.wrapper.Lookahead(base_optimizer, sync_rate=0.5, sync_period=6)[source]¶
Implements the Lookahead optimizer wrapper from https://arxiv.org/pdf/1907.08610.pdf
- Parameters:
base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer
sync_rate (int, optional) – rate of weight synchronization
sync_period (int, optional) – number of step performed on fast weights before weight synchronization
- class holocron.optim.wrapper.Scout(base_optimizer, sync_rate=0.5, sync_period=6)[source]¶
Implements a new optimizer wrapper based on the initial Lookahead paper https://arxiv.org/pdf/1907.08610.pdf
- Parameters:
base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer
sync_rate (int, optional) – rate of weight synchronization
sync_period (int, optional) – number of step performed on fast weights before weight synchronization
Learning rate schedulers¶
holocron.optim.lr_scheduler
provides several methods to adjust the learning
rate based on the number of epochs. holocron.optim.lr_scheduler.OneCycleScheduler
allows dynamic learning rate reducing based on some validation measurements.
Learning rate scheduling should be applied after optimizer’s update; e.g., you should write your code this way:
>>> scheduler = ...
>>> for epoch in range(100):
>>> train(...)
>>> validate(...)
>>> scheduler.step()
- class holocron.optim.lr_scheduler.OneCycleScheduler(optimizer, total_size, max_lr=None, warmup_ratio=0.3, phases=None, base_ratio=0.2, final_ratio=None, cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1)[source]¶
Implements the One Cycle scheduler from https://arxiv.org/pdf/1803.09820.pdf
- Parameters:
optimizer (Optimizer) – Wrapped optimizer.
total_size (int) – Number of training iterations to be performed
max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_lr - base_lr). The lr at any cycle is the sum of base_lr and some scaling of the amplitude; therefore max_lr may not actually be reached depending on scaling function.
warmup_ratio (float) – ratio of iterations used to reach max_lr
phases (tuple) – specify the scaling mode of both phases (possible values: ‘linear’, ‘cosine’)
base_ratio (float) – ratio between base_lr and max_lr during warmup phase
final_ratio (float) – ratio between base_lr and max_lr during last phase
cycle_momentum (bool) – If
True
, momentum is cycled inversely to learning rate between ‘base_momentum’ and ‘max_momentum’. Default: Truebase_momentum (float or list) – Lower momentum boundaries in the cycle for each parameter group. Note that momentum is cycled inversely to learning rate; at the peak of a cycle, momentum is ‘base_momentum’ and learning rate is ‘max_lr’. Default: 0.8
max_momentum (float or list) – Upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_momentum - base_momentum). The momentum at any cycle is the difference of max_momentum and some scaling of the amplitude; therefore base_momentum may not actually be reached depending on scaling function. Note that momentum is cycled inversely to learning rate; at the start of a cycle, momentum is ‘max_momentum’ and learning rate is ‘base_lr’ Default: 0.9
last_epoch (int) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_epoch=-1, the schedule is started from the beginning. Default: -1