holocron.optim¶
To use holocron.optim you have to construct an optimizer object, that will hold
the current state and will update the parameters based on the computed gradients.
Optimizers¶
Implementations of recent parameter optimizer for Pytorch modules.
LARS
¶
LARS(params: Iterable[Parameter], lr: float = 0.001, momentum: float = 0.0, dampening: float = 0.0, weight_decay: float = 0.0, nesterov: bool = False, scale_clip: tuple[float, float] | None = None)
Bases: Optimizer
Implements the LARS optimizer from "Large batch training of convolutional networks".
The estimation of global and local learning rates is described as follows, \(\forall t \geq 1\):
where \(\theta_t\) is the parameter value at step \(t\) (\(\theta_0\) being the initialization value), \(g_t\) is the gradient of \(\theta_t\), \(T\) is the total number of steps, \(\alpha\) is the learning rate \(\lambda \geq 0\) is the weight decay.
Then we estimate the momentum using:
where \(m\) is the momentum and \(v_0 = 0\).
And finally the update step is performed using the following rule:
| PARAMETER | DESCRIPTION |
|---|---|
params
|
iterable of parameters to optimize or dicts defining parameter groups
TYPE:
|
lr
|
learning rate
TYPE:
|
momentum
|
momentum factor
TYPE:
|
weight_decay
|
weight decay (L2 penalty)
TYPE:
|
dampening
|
dampening for momentum
TYPE:
|
nesterov
|
enables Nesterov momentum
TYPE:
|
scale_clip
|
the lower and upper bounds for the weight norm in local LR of LARS |
Source code in holocron/optim/lars.py
step
¶
Performs a single optimization step.
| PARAMETER | DESCRIPTION |
|---|---|
closure
|
A closure that reevaluates the model and returns the loss. |
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
loss value |
Source code in holocron/optim/lars.py
LAMB
¶
LAMB(params: Iterable[Parameter], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, scale_clip: tuple[float, float] | None = None)
Bases: Optimizer
Implements the Lamb optimizer from "Large batch optimization for deep learning: training BERT in 76 minutes".
The estimation of momentums is described as follows, \(\forall t \geq 1\):
where \(g_t\) is the gradient of \(\theta_t\), \(\beta_1, \beta_2 \in [0, 1]^2\) are the exponential average smoothing coefficients, \(m_0 = 0,\ v_0 = 0\).
Then we correct their biases using:
And finally the update step is performed using the following rule:
where \(\theta_t\) is the parameter value at step \(t\) (\(\theta_0\) being the initialization value), \(\phi\) is a clipping function, \(\alpha\) is the learning rate, \(\lambda \geq 0\) is the weight decay, \(\epsilon > 0\).
| PARAMETER | DESCRIPTION |
|---|---|
params
|
iterable of parameters to optimize or dicts defining parameter groups
TYPE:
|
lr
|
learning rate
TYPE:
|
betas
|
beta coefficients used for running averages |
eps
|
term added to the denominator to improve numerical stability
TYPE:
|
weight_decay
|
weight decay (L2 penalty)
TYPE:
|
scale_clip
|
the lower and upper bounds for the weight norm in local LR of LARS |
Source code in holocron/optim/lamb.py
step
¶
Performs a single optimization step.
| PARAMETER | DESCRIPTION |
|---|---|
closure
|
A closure that reevaluates the model and returns the loss. |
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
loss value |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
if the optimizer does not support sparse gradients |
Source code in holocron/optim/lamb.py
RaLars
¶
RaLars(params: Iterable[Parameter], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, force_adaptive_momentum: bool = False, scale_clip: tuple[float, float] | None = None)
Bases: Optimizer
Implements the RAdam optimizer from "On the variance of the Adaptive Learning Rate and Beyond" with optional Layer-wise adaptive Scaling from "Large Batch Training of Convolutional Networks"
| PARAMETER | DESCRIPTION |
|---|---|
params
|
iterable of parameters to optimize or dicts defining parameter groups
TYPE:
|
lr
|
learning rate
TYPE:
|
betas
|
coefficients used for running averages |
eps
|
term added to the denominator to improve numerical stability
TYPE:
|
weight_decay
|
weight decay (L2 penalty)
TYPE:
|
force_adaptive_momentum
|
use adaptive momentum if variance is not tractable
TYPE:
|
scale_clip
|
the maximal upper bound for the scale factor of LARS |
Source code in holocron/optim/ralars.py
step
¶
Performs a single optimization step.
| PARAMETER | DESCRIPTION |
|---|---|
closure
|
A closure that reevaluates the model and returns the loss. |
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
loss value |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
if the optimizer does not support sparse gradients |
Source code in holocron/optim/ralars.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | |
TAdam
¶
TAdam(params: Iterable[Parameter], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, amsgrad: bool = False, dof: float | None = None)
Bases: Optimizer
Implements the TAdam optimizer from "TAdam: A Robust Stochastic Gradient Optimizer".
The estimation of momentums is described as follows, \(\forall t \geq 1\):
where \(g_t\) is the gradient of \(\theta_t\), \(\beta_1, \beta_2 \in [0, 1]^2\) are the exponential average smoothing coefficients, \(m_0 = 0,\ v_0 = 0,\ W_0 = \frac{\beta_1}{1 - \beta_1}\); \(\nu\) is the degrees of freedom and \(d\) if the number of dimensions of the parameter gradient.
Then we correct their biases using:
And finally the update step is performed using the following rule:
where \(\theta_t\) is the parameter value at step \(t\) (\(\theta_0\) being the initialization value), \(\alpha\) is the learning rate, \(\epsilon > 0\).
| PARAMETER | DESCRIPTION |
|---|---|
params
|
iterable of parameters to optimize or dicts defining parameter groups
TYPE:
|
lr
|
learning rate
TYPE:
|
betas
|
coefficients used for running averages |
eps
|
term added to the denominator to improve numerical stability
TYPE:
|
weight_decay
|
weight decay (L2 penalty)
TYPE:
|
dof
|
degrees of freedom
TYPE:
|
Source code in holocron/optim/tadam.py
step
¶
Performs a single optimization step.
| PARAMETER | DESCRIPTION |
|---|---|
closure
|
A closure that reevaluates the model and returns the loss. |
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
loss value |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
if the optimizer does not support sparse gradients |
Source code in holocron/optim/tadam.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
AdaBelief
¶
Bases: Adam
Implements the AdaBelief optimizer from "AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients".
The estimation of momentums is described as follows, \(\forall t \geq 1\):
where \(g_t\) is the gradient of \(\theta_t\), \(\beta_1, \beta_2 \in [0, 1]^2\) are the exponential average smoothing coefficients, \(m_0 = 0,\ s_0 = 0\), \(\epsilon > 0\).
Then we correct their biases using:
And finally the update step is performed using the following rule:
where \(\theta_t\) is the parameter value at step \(t\) (\(\theta_0\) being the initialization value), \(\alpha\) is the learning rate, \(\epsilon > 0\).
| PARAMETER | DESCRIPTION |
|---|---|
params
|
iterable of parameters to optimize or dicts defining parameter groups
|
lr
|
learning rate
|
betas
|
coefficients used for running averages
|
eps
|
term added to the denominator to improve numerical stability
|
weight_decay
|
weight decay (L2 penalty)
|
amsgrad
|
whether to use the AMSGrad variant
|
step
¶
Performs a single optimization step.
| PARAMETER | DESCRIPTION |
|---|---|
closure
|
A closure that reevaluates the model and returns the loss. |
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
loss value |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
if the optimizer does not support sparse gradients |
Source code in holocron/optim/adabelief.py
AdamP
¶
AdamP(params: Iterable[Parameter], lr: float = 0.001, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0, amsgrad: bool = False, delta: float = 0.1)
Bases: Adam
Implements the AdamP optimizer from "AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights".
The estimation of momentums is described as follows, \(\forall t \geq 1\):
where \(g_t\) is the gradient of \(\theta_t\), \(\beta_1, \beta_2 \in [0, 1]^2\) are the exponential average smoothing coefficients, \(m_0 = g_0,\ v_0 = 0\).
Then we correct their biases using:
And finally the update step is performed using the following rule:
where \(\theta_t\) is the parameter value at step \(t\) (\(\theta_0\) being the initialization value), \(\prod_{\theta_t}(p_t)\) is the projection of \(p_t\) onto the tangent space of \(\theta_t\), \(cos(\theta_t, g_t)\) is the cosine similarity between \(\theta_t\) and \(g_t\), \(\alpha\) is the learning rate, \(\delta > 0\), \(\epsilon > 0\).
| PARAMETER | DESCRIPTION |
|---|---|
params
|
iterable of parameters to optimize or dicts defining parameter groups
TYPE:
|
lr
|
learning rate
TYPE:
|
betas
|
coefficients used for running averages |
eps
|
term added to the denominator to improve numerical stability
TYPE:
|
weight_decay
|
weight decay (L2 penalty)
TYPE:
|
amsgrad
|
whether to use the AMSGrad variant
TYPE:
|
delta
|
delta threshold for projection
TYPE:
|
Source code in holocron/optim/adamp.py
step
¶
Performs a single optimization step.
| PARAMETER | DESCRIPTION |
|---|---|
closure
|
A closure that reevaluates the model and returns the loss. |
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
loss value |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
if the optimizer does not support sparse gradients |
Source code in holocron/optim/adamp.py
Adan
¶
Adan(params: Iterable[Parameter], lr: float = 0.001, betas: tuple[float, float, float] = (0.98, 0.92, 0.99), eps: float = 1e-08, weight_decay: float = 0.0, amsgrad: bool = False)
Bases: Adam
Implements the Adan optimizer from "Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models".
The estimation of momentums is described as follows, \(\forall t \geq 1\):
where \(g_t\) is the gradient of \(\theta_t\), \(\beta_1, \beta_2, \beta_3 \in [0, 1]^3\) are the exponential average smoothing coefficients, \(m_0 = g_0,\ v_0 = 0,\ n_0 = g_0^2\).
Then we correct their biases using:
And finally the update step is performed using the following rule:
where \(\theta_t\) is the parameter value at step \(t\) (\(\theta_0\) being the initialization value), \(\alpha\) is the learning rate, \(\lambda \geq 0\) is the weight decay, \(\epsilon > 0\).
| PARAMETER | DESCRIPTION |
|---|---|
params
|
iterable of parameters to optimize or dicts defining parameter groups
TYPE:
|
lr
|
learning rate
TYPE:
|
betas
|
coefficients used for running averages
TYPE:
|
eps
|
term added to the denominator to improve numerical stability
TYPE:
|
weight_decay
|
weight decay (L2 penalty)
TYPE:
|
amsgrad
|
whether to use the AMSGrad variant
TYPE:
|
Source code in holocron/optim/adan.py
step
¶
Performs a single optimization step.
| PARAMETER | DESCRIPTION |
|---|---|
closure
|
A closure that reevaluates the model and returns the loss. |
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
loss value |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
if the optimizer does not support sparse gradients |
Source code in holocron/optim/adan.py
AdEMAMix
¶
AdEMAMix(params: Iterable[Parameter], lr: float = 0.001, betas: tuple[float, float, float] = (0.9, 0.999, 0.9999), alpha: float = 5.0, eps: float = 1e-08, weight_decay: float = 0.0)
Bases: Optimizer
Implements the AdEMAMix optimizer from "The AdEMAMix Optimizer: Better, Faster, Older".
The estimation of momentums is described as follows, \(\forall t \geq 1\):
where \(g_t\) is the gradient of \(\theta_t\), \(\beta_1, \beta_2, \beta_3 \in [0, 1]^3\) are the exponential average smoothing coefficients, \(m_{1,0} = 0,\ m_{2,0} = 0,\ s_0 = 0\), \(\epsilon > 0\).
Then we correct their biases using:
And finally the update step is performed using the following rule:
where \(\theta_t\) is the parameter value at step \(t\) (\(\theta_0\) being the initialization value), \(\eta\) is the learning rate, \(\alpha > 0\) \(\epsilon > 0\).
| PARAMETER | DESCRIPTION |
|---|---|
params
|
iterable of parameters to optimize or dicts defining parameter groups
TYPE:
|
lr
|
learning rate
TYPE:
|
betas
|
coefficients used for running averages
TYPE:
|
alpha
|
the exponential decay rate of the second moment estimates
TYPE:
|
eps
|
term added to the denominator to improve numerical stability
TYPE:
|
weight_decay
|
weight decay (L2 penalty)
TYPE:
|
Source code in holocron/optim/ademamix.py
step
¶
Performs a single optimization step.
| PARAMETER | DESCRIPTION |
|---|---|
closure
|
A closure that reevaluates the model and returns the loss.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
float | None: loss value |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
if the optimizer does not support sparse gradients |
Source code in holocron/optim/ademamix.py
Optimizer wrappers¶
holocron.optim also implements optimizer wrappers.
A base optimizer should always be passed to the wrapper; e.g., you should write your code this way:
Lookahead
¶
Bases: Optimizer
Implements the Lookahead optimizer wrapper from "Lookahead Optimizer: k steps forward, 1 step back"
<https://arxiv.org/pdf/1907.08610.pdf>_.
from torch.optim import AdamW from holocron.optim.wrapper import Lookahead model = ... opt = AdamW(model.parameters(), lr=3e-4) opt_wrapper = Lookahead(opt)
| PARAMETER | DESCRIPTION |
|---|---|
base_optimizer
|
base parameter optimizer
TYPE:
|
sync_rate
|
rate of weight synchronization
TYPE:
|
sync_period
|
number of step performed on fast weights before weight synchronization
TYPE:
|
Source code in holocron/optim/wrapper.py
step
¶
Performs a single optimization step.
| PARAMETER | DESCRIPTION |
|---|---|
closure
|
A closure that reevaluates the model and returns the loss. |
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
loss value |
Source code in holocron/optim/wrapper.py
add_param_group
¶
Adds a parameter group to base optimizer (fast weights) and its corresponding slow version
| PARAMETER | DESCRIPTION |
|---|---|
param_group
|
parameter group |
Source code in holocron/optim/wrapper.py
sync_params
¶
sync_params(sync_rate: float = 0.0) -> None
Synchronize parameters as follows: slow_param <- slow_param + sync_rate * (fast_param - slow_param)
| PARAMETER | DESCRIPTION |
|---|---|
sync_rate
|
synchronization rate of parameters
TYPE:
|
Source code in holocron/optim/wrapper.py
Scout
¶
Bases: Optimizer
Implements a new optimizer wrapper based on "Lookahead Optimizer: k steps forward, 1 step back".
Example
from torch.optim import AdamW from holocron.optim.wrapper import Scout model = ... opt = AdamW(model.parameters(), lr=3e-4) opt_wrapper = Scout(opt)
| PARAMETER | DESCRIPTION |
|---|---|
base_optimizer
|
base parameter optimizer
TYPE:
|
sync_rate
|
rate of weight synchronization
TYPE:
|
sync_period
|
number of step performed on fast weights before weight synchronization
TYPE:
|
Source code in holocron/optim/wrapper.py
step
¶
Performs a single optimization step.
| PARAMETER | DESCRIPTION |
|---|---|
closure
|
A closure that reevaluates the model and returns the loss. |
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
loss value |
Source code in holocron/optim/wrapper.py
add_param_group
¶
Adds a parameter group to base optimizer (fast weights) and its corresponding slow version
| PARAMETER | DESCRIPTION |
|---|---|
param_group
|
parameter group |
Source code in holocron/optim/wrapper.py
sync_params
¶
sync_params(sync_rate: float = 0.0) -> None
Synchronize parameters as follows: slow_param <- slow_param + sync_rate * (fast_param - slow_param)
| PARAMETER | DESCRIPTION |
|---|---|
sync_rate
|
synchronization rate of parameters
TYPE:
|