holocron.optim

To use holocron.optim you have to construct an optimizer object, that will hold the current state and will update the parameters based on the computed gradients.

Optimizers

class holocron.optim.Lamb(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, scale_clip=None)[source]

Implements the Lamb optimizer from https://arxiv.org/pdf/1904.00962v3.pdf.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • betas (Tuple[float, float], optional) – beta coefficients used for running averages (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS

class holocron.optim.Lars(params, lr=0.001, momentum=0, dampening=0, weight_decay=0, nesterov=False, scale_clip=None)[source]

Implements the LARS optimizer from https://arxiv.org/pdf/1708.03888.pdf

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • momentum (float, optional) – momentum factor (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

  • scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS

class holocron.optim.RAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]

Implements the RAdam optimizer from https://arxiv.org/pdf/1908.03265.pdf

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

class holocron.optim.RaLars(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, force_adaptive_momentum=False, scale_clip=None)[source]

Implements the RAdam optimizer from https://arxiv.org/pdf/1908.03265.pdf with optional Layer-wise adaptive Scaling from https://arxiv.org/pdf/1708.03888.pdf

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • force_adaptive_momentum (float, optional) – use adaptive momentum if variance is not tractable (default: False)

  • scale_clip (float, optional) – the maximal upper bound for the scale factor of LARS

Optimizer wrappers

holocron.optim implements optimizer wrappers.

A base optimizer should always be passed to the wrapper; e.g., you should write your code this way:

>>> optimizer = ...
>>> optimizer = wrapper(optimizer)
class holocron.optim.wrapper.Lookahead(base_optimizer, sync_rate=0.5, sync_period=6)[source]

Implements the Lookahead optimizer wrapper from https://arxiv.org/pdf/1907.08610.pdf

Parameters:
  • base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer

  • sync_rate (int, optional) – rate of weight synchronization

  • sync_period (int, optional) – number of step performed on fast weights before weight synchronization

class holocron.optim.wrapper.Scout(base_optimizer, sync_rate=0.5, sync_period=6)[source]

Implements a new optimizer wrapper based on the initial Lookahead paper https://arxiv.org/pdf/1907.08610.pdf

Parameters:
  • base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer

  • sync_rate (int, optional) – rate of weight synchronization

  • sync_period (int, optional) – number of step performed on fast weights before weight synchronization

Learning rate schedulers

holocron.optim.lr_scheduler provides several methods to adjust the learning rate based on the number of epochs. holocron.optim.lr_scheduler.OneCycleScheduler allows dynamic learning rate reducing based on some validation measurements.

Learning rate scheduling should be applied after optimizer’s update; e.g., you should write your code this way:

>>> scheduler = ...
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()
class holocron.optim.lr_scheduler.OneCycleScheduler(optimizer, total_size, max_lr=None, warmup_ratio=0.3, phases=None, base_ratio=0.2, final_ratio=None, cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1)[source]

Implements the One Cycle scheduler from https://arxiv.org/pdf/1803.09820.pdf

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.

  • total_size (int) – Number of training iterations to be performed

  • max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_lr - base_lr). The lr at any cycle is the sum of base_lr and some scaling of the amplitude; therefore max_lr may not actually be reached depending on scaling function.

  • warmup_ratio (float) – ratio of iterations used to reach max_lr

  • phases (tuple) – specify the scaling mode of both phases (possible values: ‘linear’, ‘cosine’)

  • base_ratio (float) – ratio between base_lr and max_lr during warmup phase

  • final_ratio (float) – ratio between base_lr and max_lr during last phase

  • cycle_momentum (bool) – If True, momentum is cycled inversely to learning rate between ‘base_momentum’ and ‘max_momentum’. Default: True

  • base_momentum (float or list) – Lower momentum boundaries in the cycle for each parameter group. Note that momentum is cycled inversely to learning rate; at the peak of a cycle, momentum is ‘base_momentum’ and learning rate is ‘max_lr’. Default: 0.8

  • max_momentum (float or list) – Upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_momentum - base_momentum). The momentum at any cycle is the difference of max_momentum and some scaling of the amplitude; therefore base_momentum may not actually be reached depending on scaling function. Note that momentum is cycled inversely to learning rate; at the start of a cycle, momentum is ‘max_momentum’ and learning rate is ‘base_lr’ Default: 0.9

  • last_epoch (int) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_epoch=-1, the schedule is started from the beginning. Default: -1