holocron.optim¶

To use holocron.optim you have to construct an optimizer object, that will hold the current state and will update the parameters based on the computed gradients.

Optimizers¶

class holocron.optim.Lamb(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, scale_clip=None)[source]¶

Implements the Lamb optimizer from https://arxiv.org/pdf/1904.00962v3.pdf.

Parameters:

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
betas (Tuple[float, float], optional) – beta coefficients used for running averages (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS

class holocron.optim.Lars(params, lr=0.001, momentum=0, dampening=0, weight_decay=0, nesterov=False, scale_clip=None)[source]¶

Implements the LARS optimizer from https://arxiv.org/pdf/1708.03888.pdf

Parameters:

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS

class holocron.optim.RAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]¶

Implements the RAdam optimizer from https://arxiv.org/pdf/1908.03265.pdf

Parameters:

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

class holocron.optim.RaLars(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, force_adaptive_momentum=False, scale_clip=None)[source]¶

Implements the RAdam optimizer from https://arxiv.org/pdf/1908.03265.pdf with optional Layer-wise adaptive Scaling from https://arxiv.org/pdf/1708.03888.pdf

Parameters:

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate
betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
force_adaptive_momentum (float, optional) – use adaptive momentum if variance is not tractable (default: False)
scale_clip (float, optional) – the maximal upper bound for the scale factor of LARS

Optimizer wrappers¶

holocron.optim implements optimizer wrappers.

A base optimizer should always be passed to the wrapper; e.g., you should write your code this way:

>>> optimizer = ...
>>> optimizer = wrapper(optimizer)

class holocron.optim.wrapper.Lookahead(base_optimizer, sync_rate=0.5, sync_period=6)[source]¶

Implements the Lookahead optimizer wrapper from https://arxiv.org/pdf/1907.08610.pdf

Parameters:

base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer
sync_rate (int, optional) – rate of weight synchronization
sync_period (int, optional) – number of step performed on fast weights before weight synchronization

class holocron.optim.wrapper.Scout(base_optimizer, sync_rate=0.5, sync_period=6)[source]¶

Implements a new optimizer wrapper based on the initial Lookahead paper https://arxiv.org/pdf/1907.08610.pdf

Parameters:

base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer
sync_rate (int, optional) – rate of weight synchronization
sync_period (int, optional) – number of step performed on fast weights before weight synchronization

Learning rate schedulers¶

holocron.optim.lr_scheduler provides several methods to adjust the learning rate based on the number of epochs. holocron.optim.lr_scheduler.OneCycleScheduler allows dynamic learning rate reducing based on some validation measurements.

Learning rate scheduling should be applied after optimizer’s update; e.g., you should write your code this way:

>>> scheduler = ...
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()

class holocron.optim.lr_scheduler.OneCycleScheduler(optimizer, total_size, max_lr=None, warmup_ratio=0.3, phases=None, base_ratio=0.2, final_ratio=None, cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1)[source]¶

Implements the One Cycle scheduler from https://arxiv.org/pdf/1803.09820.pdf

Parameters:

optimizer (Optimizer) – Wrapped optimizer.
total_size (int) – Number of training iterations to be performed
max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_lr - base_lr). The lr at any cycle is the sum of base_lr and some scaling of the amplitude; therefore max_lr may not actually be reached depending on scaling function.
warmup_ratio (float) – ratio of iterations used to reach max_lr
phases (tuple) – specify the scaling mode of both phases (possible values: ‘linear’, ‘cosine’)
base_ratio (float) – ratio between base_lr and max_lr during warmup phase
final_ratio (float) – ratio between base_lr and max_lr during last phase
cycle_momentum (bool) – If True, momentum is cycled inversely to learning rate between ‘base_momentum’ and ‘max_momentum’. Default: True
base_momentum (float or list) – Lower momentum boundaries in the cycle for each parameter group. Note that momentum is cycled inversely to learning rate; at the peak of a cycle, momentum is ‘base_momentum’ and learning rate is ‘max_lr’. Default: 0.8
max_momentum (float or list) – Upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_momentum - base_momentum). The momentum at any cycle is the difference of max_momentum and some scaling of the amplitude; therefore base_momentum may not actually be reached depending on scaling function. Note that momentum is cycled inversely to learning rate; at the start of a cycle, momentum is ‘max_momentum’ and learning rate is ‘base_lr’ Default: 0.9
last_epoch (int) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_epoch=-1, the schedule is started from the beginning. Default: -1