flint.optim¶

class flint.optim.Optimizer(params=None, lr: float = 0.01, weight_decay: float = 0.0)[source]¶

Bases: object

Base class for all optimizers.

Parameters

params (iterable) – An iterable of Tensor
lr (float, optional, default=0.01) – Learning rate
weight_decay (float, optional, default=0.) – Weight decay (L2 penalty)

step()[source]¶

zero_grad()[source]¶: Set the gradients of all parameters to zero.

class flint.optim.SGD(params=None, lr: float = 0.01, momentum: float = 0.0, nesterov: bool = False, weight_decay: float = 0.0)[source]¶

Bases: flint.optim.optimizer.Optimizer

Implementation of Stochastic Gradient Descent (optionally with momentum).

\[v_{t+1} = \mu \cdot v_t + g_{t+1} \]

\[\theta_{t+1} = \theta_t - \text{lr} \cdot v_{t+1} \]

where \(\theta\), \(g\), \(v\) and \(\mu\) denote the parameters, gradient, velocity, and momentum respectively.

Parameters

params (iterable) – An iterable of Tensor
lr (float, optional, default=0.01) – Learning rate
momentum (float, optional, default=0.) – Momentum factor
nesterov (bool, optional, default=False) – Enable Nesterov momentum or not
weight_decay (float, optional, default=0) – Weight decay (L2 penalty)

step()[source]¶

class flint.optim.Adadelta(params=None, rho: float = 0.99, eps: float = 1e-06, lr: float = 1.0, weight_decay: float = 0.0)[source]¶

Bases: flint.optim.optimizer.Optimizer

Implementation of Adadelta algorithm proposed in [1].

\[h_t = \rho h_{t-1} + (1 - \rho) g_t^2 \]

\[g'_t = \sqrt{\frac{\Delta \theta_{t-1} + \epsilon}{h_t + \epsilon}} \cdot g_t \]

\[\Delta \theta_t = \rho \Delta \theta_{t-1} + (1 - \rho) (g'_t)^2 \]

\[\theta_t = \theta_{t-1} - g'_t \]

where \(h\) is the moving average of the squared gradients, \(\epsilon\) is for improving numerical stability.

Parameters

params (iterable) – An iterable of Tensor
rho (float, optional, default=0.9) – Coefficient used for computing a running average of squared gradients
eps (float, optional, default=1e-6) – Term added to the denominator to improve numerical stability
lr (float, optional, default=1.0) – Coefficient that scale delta before it is applied to the parameters
weight_decay (float, optional, default=0) – Weight decay (L2 penalty)

References

“ADADELTA: An Adaptive Learning Rate Method. Matthew D. Zeiler.” arxiv 2012.

step()[source]¶

class flint.optim.Adagrad(params=None, lr: float = 0.01, eps: float = 1e-10, weight_decay: float = 0.0)[source]¶

Bases: flint.optim.optimizer.Optimizer

Implementation of Adagrad algorithm proposed in [1].

\[h_t = h_{t-1} + g_t^2 \]

\[\theta_{t+1} = \theta_t - \frac{\text{lr}}{\sqrt{h_t + \epsilon}} \cdot g_t \]

Parameters

params (iterable) – An iterable of Tensor
lr (float, optional, default=0.01) – Learning rate
eps (float, optional, default=1e-10) – Term added to the denominator to improve numerical stability
weight_decay (float, optional, default=0)) – Weight decay (L2 penalty)

References

“Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” John Duchi, et al. JMRL 2011.

step()[source]¶

class flint.optim.Adam(params=None, lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0)[source]¶

Bases: flint.optim.optimizer.Optimizer

Implementation of Adam algorithm proposed in [1].

\[v_t = \beta_1 v_{t-1} + (1 - \beta_1) g_t \]

\[h_t = \beta_2 h_{t-1} + (1 - \beta_2) g_t^2 \]

Bias correction:

\[\hat{v}_t = \frac{v_t}{1 - \beta_1^t} \]

\[\hat{h}_t = \frac{h_t}{1 - \beta_2^t} \]

Update parameters:

\[\theta_t = \theta_{t-1} - \text{lr} \cdot \frac{\hat{v}_t}{\sqrt{\hat{h}_t + \epsilon}} \]

Parameters

params (iterable) – An iterable of Tensor
lr (float, optional, default=1e-3) – Learning rate
betas (Tuple[float, float], optional, default=(0.9, 0.999)) – Coefficients used for computing running averages of gradient and its square
eps (float, optional, default=1e-8) – Term added to the denominator to improve numerical stability
weight_decay (float, optional, default=0) – Weight decay (L2 penalty)

References

“Adam: A Method for Stochastic Optimization.” Diederik P. Kingma and Jimmy Ba. ICLR 2015.

step()[source]¶

class flint.optim.RMSprop(params=None, lr: float = 0.01, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0.0)[source]¶

Bases: flint.optim.optimizer.Optimizer

Implementation of RMSprop algorithm proposed in [1].

\[h_t = \alpha h_{t-1} + (1 - \alpha) g_t^2 \]

\[\theta_{t+1} = \theta_t - \frac{\text{lr}}{\sqrt{h_t + \epsilon}} \cdot g_t \]

Parameters

params (iterable)) – An iterable of Tensor
lr (float, optional, default=0.01)) – Learning rate
alpha (float, optional, default=0.99) – Coefficient used for computing a running average of squared gradients
eps (float, optional, default=1e-8) – Term added to the denominator to improve numerical stability
weight_decay (float, optional, default=0) – Weight decay (L2 penalty)

References

“Neural Networks for Machine Learning, Lecture 6e - rmsprop: Divide the gradient by a running average of its recent magnitude.” Geoffrey Hinton.

step()[source]¶