holocron.optim

To use holocron.optim you have to construct an optimizer object, that will hold the current state and will update the parameters based on the computed gradients.

Optimizers

Implementations of recent parameter optimizer for Pytorch modules.

class holocron.optim.Lamb(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, scale_clip=None)[source]

Implements the Lamb optimizer from “Large batch optimization for deep learning: training BERT in 76 minutes”.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • betas (Tuple[float, float], optional) – beta coefficients used for running averages (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS

class holocron.optim.Lars(params, lr=0.001, momentum=0, dampening=0, weight_decay=0, nesterov=False, scale_clip=None)[source]

Implements the LARS optimizer from “Large batch training of convolutional networks”.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • momentum (float, optional) – momentum factor (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

  • scale_clip (tuple, optional) – the lower and upper bounds for the weight norm in local LR of LARS

class holocron.optim.RAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]

Implements the RAdam optimizer from “On the variance of the Adaptive Learning Rate and Beyond”.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

class holocron.optim.RaLars(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, force_adaptive_momentum=False, scale_clip=None)[source]

Implements the RAdam optimizer from “On the variance of the Adaptive Learning Rate and Beyond” with optional Layer-wise adaptive Scaling from “Large Batch Training of Convolutional Networks”

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate

  • betas (Tuple[float, float], optional) – coefficients used for running averages (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • force_adaptive_momentum (float, optional) – use adaptive momentum if variance is not tractable (default: False)

  • scale_clip (float, optional) – the maximal upper bound for the scale factor of LARS

Optimizer wrappers

holocron.optim also implements optimizer wrappers.

A base optimizer should always be passed to the wrapper; e.g., you should write your code this way:

>>> optimizer = ...
>>> optimizer = wrapper(optimizer)
class holocron.optim.wrapper.Lookahead(base_optimizer, sync_rate=0.5, sync_period=6)[source]

Implements the Lookahead optimizer wrapper from “Lookahead Optimizer: k steps forward, 1 step back”.

Parameters:
  • base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer

  • sync_rate (int, optional) – rate of weight synchronization

  • sync_period (int, optional) – number of step performed on fast weights before weight synchronization

class holocron.optim.wrapper.Scout(base_optimizer, sync_rate=0.5, sync_period=6)[source]

Implements a new optimizer wrapper based on “Lookahead Optimizer: k steps forward, 1 step back”.

Parameters:
  • base_optimizer (torch.optim.optimizer.Optimizer) – base parameter optimizer

  • sync_rate (int, optional) – rate of weight synchronization

  • sync_period (int, optional) – number of step performed on fast weights before weight synchronization

Learning rate schedulers

holocron.optim.lr_scheduler provides several methods to adjust the learning rate based on the number of epochs. holocron.optim.lr_scheduler.OneCycleScheduler allows dynamic learning rate reducing based on some validation measurements.

Learning rate scheduling should be applied after optimizer’s update; e.g., you should write your code this way:

>>> scheduler = ...
>>> for epoch in range(100):
>>>     train(...)
>>>     validate(...)
>>>     scheduler.step()
class holocron.optim.lr_scheduler.OneCycleScheduler(optimizer, total_size, max_lr=None, warmup_ratio=0.3, phases=None, base_ratio=0.2, final_ratio=None, cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1)[source]

Implements the One Cycle scheduler from “A disciplined approach to neural network hyper-parameters”. Please note that this implementation was made before pytorch supports it, using the official Pytorch implementation is advised.

Parameters:
  • optimizer (Optimizer) – Wrapped optimizer.

  • total_size (int) – Number of training iterations to be performed

  • max_lr (float or list) – Upper learning rate boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_lr - base_lr). The lr at any cycle is the sum of base_lr and some scaling of the amplitude; therefore max_lr may not actually be reached depending on scaling function.

  • warmup_ratio (float) – ratio of iterations used to reach max_lr

  • phases (tuple) – specify the scaling mode of both phases (possible values: ‘linear’, ‘cosine’)

  • base_ratio (float) – ratio between base_lr and max_lr during warmup phase

  • final_ratio (float) – ratio between base_lr and max_lr during last phase

  • cycle_momentum (bool) – If True, momentum is cycled inversely to learning rate between ‘base_momentum’ and ‘max_momentum’. Default: True

  • base_momentum (float or list) – Lower momentum boundaries in the cycle for each parameter group. Note that momentum is cycled inversely to learning rate; at the peak of a cycle, momentum is ‘base_momentum’ and learning rate is ‘max_lr’. Default: 0.8

  • max_momentum (float or list) – Upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_momentum - base_momentum). The momentum at any cycle is the difference of max_momentum and some scaling of the amplitude; therefore base_momentum may not actually be reached depending on scaling function. Note that momentum is cycled inversely to learning rate; at the start of a cycle, momentum is ‘max_momentum’ and learning rate is ‘base_lr’ Default: 0.9

  • last_epoch (int) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_epoch=-1, the schedule is started from the beginning. Default: -1